1 Introduction

With the widespread application of deep learning in image recognition, research related to adversarial samples has encouraged people to rethink the security of deep networks in practical applications. Szegedy et al. [1] firstly introduced the principle of adversarial samples as the small disturbances that do not affect the judgment of the human eye but can make the deep learning network model recognize the wrong category with high confidence. Since then, adversarial samples have been rapidly developed. In the research of adversarial examples, transferable adversarial examples have become an important research direction due to their flexible and extensive application scenarios. In the research of the white-box attack algorithm, the Fast Gradient Sign Method (FGSM) [2] first creates adversarial samples by superimposing the single-step loss gradient through Sign operation with the original image. Based on this algorithm, subsequent studies have successively proposed the Basic Iterative Method (BIM) [3], Momentum Iterative Method (MIM) [4], and Projected Gradient Descent method (PGD) [5] under the demand to improve adversarial examples’ transferability. The adversarial samples’ transferability has been continuously improved through iteration, momentum, and gradient projection. In the black-box algorithm, Natural Evolution Strategies (NES) [6] and Simultaneous Perturbation Stochastic Approximation (SPSA) [7] algorithms improve the transferability of adversarial disturbances under a single estimate, declining the number of queries to enhance the performance of black-box attacks.

In order to promote robustness, the ensemble has become an essential research direction to defend against adversarial samples. Intuitively, the voting mechanism function establishes the error prediction, considering that each sub-model converges towards the wrong prediction. Essentially, the well-calibrated uncertainty estimation for adversarial samples outside the training data distribution ensures the robustness of the ensemble model [8]. Related test results combined with research [9] introduced the concept of diversity of sub-models under ensemble conditions. They experimentally demonstrated the specific correlation of the robustness of the ensemble with the diversity of sub-models. The diversity metric is commonly summarized as the model structure diversity through the wide competition on both attack and defense sides. For attackers, ensemble tends to contain different sub-model architectures as much as possible to empirically generate superior black box alternative models. For a defender, more diverse architecture in the ensemble also means the robustness for adversarial examples [10]. Such empirical conclusions based on sub-model architecture rely on the weak correlation between architecture and gradients. More studies have demonstrated that models trained on the same data set without additional constraints are more inclined to extract the same non-robust properties [11, 12], reducing the effectiveness of such an empirical defense method in practice.

More research further defines the diversity between sub-models through the adversarial transferability description to enhance the ensemble’s robustness [13,14,15]. These methods associate sub-models transferability with three diversity assumptions: 1. Diversity of output logits’ distributions [13]; 2. Diversity of adversarial subspaces [14]; 3. Diversified overlap of non-robust features [15]. Different hypotheses have improved the robustness of the ensemble in the experiment. The common problem of these methods is that there is no accurate mathematical definition of transferability but abstract hypotheses. This paper adopts the first-order approximations of model outputs to define the adversarial transferability through the target singular value with the smallest Wasserstein distance according to the source singular vector. Through the mathematical theory of this metric, the mentioned hypotheses are further explained theoretically, and the shortcomings are analyzed in transferability evaluation. Geometrically, Fig. 1 demonstrates the difference between the proposed evaluation method and these methods by the level set of the gradient optimization problem. This paper further utilizes the proposed transferability metric as a regular constraint in the ensemble training process to isolate the transferability between sub-models. An accurate definition of transferability also disentangles the correlation between robustness and transferability through further ensemble robustness analysis. Finally, the fundamental novelties of the current paper are listed as the following:

  1. 1.

    Through the first-order approximation of the Jacobian matrix for the CNN model, the Jacobian matrix’s singular value decomposition is employed to quantitatively verify the adversarial distribution. In the case of complex parameters and algorithms of adversarial sample generation, such a definition has more general applicability in defense evaluation.

  2. 2.

    Based on the interpretability of optimal transfer theory for CNN optimization, the optimal transfer distances of different adversarial distributions are employed to describe the adversarial transferability between different models, to analyze the deficiencies of existing transferability metrics.

  3. 3.

    This paper employs the proposed adversarial transferability metric as a regular norm item in ensemble training to realize the transferability isolation between sub-models. The robust analysis demonstrates and disentangles the correlation between transferability and robustness in ensembles.

Fig. 1
figure 1

The illustration of different transferability metrics based on the optimization problem’s level set. (a) The transferability upper bound defined by cosine distance. (b) The Kullback–Leibler divergence (KLD) constrains of transferability through loss function; (c) The transferability metric defined in this paper

In the following, prior transferability hypotheses are reviewed in Section 2, and preliminaries of transfer-based attack and the first-order analysis of Jacobians matrix are introduced in Section 3. Then, the proposed transferability metric is described in Section 4. Finally, Section 5 gives the experiments and discussion.

2 Related work

Evaluation metric for adversarial transferability

The concept of adversarial transferability is defined as a diversity metric while studying the ensemble robustness [8]. In preliminary practice, sub-models transferability is first described as the diversity of model architecture. However, this evaluation metric confines the improvement of ensemble robust performance [10]. Subsequent studies mostly start from the hypothesis of the correlation between diversity and sub-model transferability to propose different evaluation metrics for ensemble robustness. Based on the model logits outputs, such transferability was evaluated through the diversity of non-maximal predictions between sub-models named Adaptive Diversity Promoting (ADP) [13]. Based on the overlap of adversarial subspaces, such transferability was evaluated through the gradient direction diversity between sub-models named Gradient Alignment Loss (GAL) [14]. Based on the non-robust features extracted by the model, such transferability was evaluated through the diversity degree of non-robust feature space overlap named Diversifying Vulnerabilities for Enhanced Robust Generation of Ensembles (DVERGE) [15]. Unlike the assumption mentioned above, this paper gives a first-order approximations expression of adversarial distribution through optimization theory, and the characterization of transferability through optimal transmission theory tends the definition of transferability toward the mathematical analysis of model attributes. In terms of quantitative values, the metric values in this paper make it a regularization norm constraint during training.

Model attributes analysis based on the Jacobian

With the growth of CNN technology, scholars are more inclined to reveal the model’s black-box attributes for more theoretical analysis. The Jacobian matrix-based analysis is getting more and more attention from researchers. On the one hand, the Jacobian matrix’s Frobenius norm is utilized in the regularization training of the model’s robustness [16, 17]. When the adversarial samples are initially discovered, the spectral norm of a certain layer weight of the model is considered a metric for evaluating the model’s sensitivity [1]. Simultaneously, this spectral norm has also been studied as a constrained metric of the generalization performance of the CNN [18], even to promote the efficiency of the generative adversarial networks (GAN) for generating diversity pictures [19, 20]. The model’s weight in a particular layer is the Jacobian matrix for extracting specific features. Based on the model attribute extraction through the spectral norm, the Jacobian matrix’s global spectral norm is further utilized to constrain the model’s robustness [21, 22].

Additionally, as opposed to the robustness problem, the research of black-box attack algorithms [23, 24] adopted the eigendecomposition of the Jacobian matrix to improve the transferability of adversarial disturbances under a single estimate, reflecting that the eigenvectors are more essential metrics to evaluate gradient changes. In [24, 25], it was revealed that the adversarial samples’ iterative generation procedure is the first-order approximation to the maximum singular vector of the Jacobian matrix through the power method [26]. Thus, the adversarial network training is equivalent to the Jacobian matrix’s spectral norm regularization. Based on these studies, the Jacobian matrix’s Frobenius norm is connected with the transferability of Universal Adversarial Perturbations (UAP) through further mathematical analysis [27]. It can be theoretically proved that the variations of upper bounds under transfer attack are quantified by the cosine distance between the Jacobian matrix when the singular values are fully aligned.

Most of the existing robust optimization approaches for the Jacobian matrix are based on the solution of its norm. Although the norm solution skips the complete singular value decomposition process to simplify the calculation process, it cannot fully describe the distribution of adversarial samples. This paper converts the high-dimensional Jacobian matrix into a two-dimensional condition for batched singular value decomposition. It is utilized as a simplified expression of the adversarial distribution model without generating adversarial samples to optimize the transferability expression under the Jacobian matrix.

3 Background

3.1 Transfer-based black-box attacks algorithm

The white-box attack considers that the model’s classification is approximately linear in the high-dimensional space of features. Thus, the gradient of loss function reflects the optimal direction to the classification boundary. Under such a linear assumption, a gradient-based white-box adversarial attack algorithm was developed. Then, the FGSM algorithm was proposed under the single-step gradient direction as (1):

$$ x^{\prime}=x+\epsilon*sign(\nabla_{x}J(f,x,y)) $$
(1)

where \(x^{\prime }\) is the adversarial example, and ∇xJ(f,x,y) is the gradient of loss function calculated under model f, image x, and label y.

This paper considers the attack transferability of adversarial examples generated from the surrogate model to the target model. Although the attacker does not know parameters, even the gradient information of the target model, it can obtain a surrogate model trained on the same dataset. Thus, the adversarial sample generated from the surrogate model under a gradient-based algorithm makes the scenario a transfer-based black-box attack. In order to simultaneously improve the attack performance of adversarial examples under the surrogate model and transferability under the target model, the iterative optimal gradient-based method has been studied, described with the following equation:

$$ x_{t+1}=\underset{X+s}{\Pi}\left( x_{t}+\epsilon*sign\left( \nabla_{x} J(f,x,y))\right)\right) $$
(2)

where πX+s represents the constraints on perturbation after iteration t. Among various iterative attack algorithms, the PGD method can improve adversarial transferability. The PGD method has been widely utilized in robustness evaluation and adversarial training due to its excellent attack performance. In the PGD method, πX+s is the projection of the gradient on the spherical surface combined with the random initialization of perturbation to improve transferability. In the following experiments, the adversarial distillation of DVERGE and the attack-based robustness analysis are performed based on the PGD method.

3.2 Robust first-order analysis based on the Jacobian matrix

Define f(x) as the logit output of convolutional neural network f under image x, while \(J_{f}(x) = \left (\frac {\partial f_{i}}{\partial x}\right )\mid _{x}\) is the Jacobian matrix under image.is the Jacobian matrix under image x. When the disturbance δ is small enough, the model’s output f can be linearly represented as the first-order Taylor series expansion through the Jacobian matrix Jf(x) by ignoring the higher-order terms:

$$ J_{f}(x)f(x+\delta) \simeq f(x)+J_{f}(x) \delta $$
(3)

The regular term Lq measures the degree of variation of the model’s output, and Cauchy’s inequality defines the upper bound of the output variation as:

$$ \Arrowvert f(x+\delta)-f(x) \Arrowvert \approx \Arrowvert J_{f}(x)\delta \Arrowvert_{q}\leq \Arrowvert J_{f}(x) \Arrowvert_{F}\Arrowvert \delta \Arrowvert $$
(4)

According to the above mathematical analysis, the Jacobian matrix’s Frobenius norm defines an upper bound on the single model’s robustness. By converting the Jacobian matrix to a stacked Jacobian matrix \(\bar {J}_{N}\), the upper bound of (4) can also define the theoretical upper bound of transferability attack, which is described with the cosine distance between the Jacobian matrices in [14, 27] as GAL:

$$ \cos \left( J_{i}, J_{j}\right)_{i \neq j} = \frac{\left\langle J_{i}(x), J_{j}(x)\right\rangle}{\left\|J_{i}(x)\right\|_{F}\left\|J_{j}(x)\right\|_{F}} \leq 1 $$
(5)

The upper bound in (5) can be realized if different Jacobian matrices share their corresponding singular vectors, while their singular values are constant up to a fixed scalar. Through the relationship between the adversarial perturbation direction and non-maximal predictions, ADP [13] defines the diversity of adversarial perturbation gradients through the divergence of non-maximal predicted ones. Understanding adversarial transferability through a distance perspective, ADP intuitively characterizes the KLD between the optimal adversarial perturbation gradients through the classification loss.

As shown in Fig. 1, there are some problems with these two definitions of transferability: (1) The Frobenius norm only describes the upper bound requirement of the alignment degree [14, 27]. Since the minimization of this metric during training is an indirect method to isolate the transferability, it cannot accurately characterize the transferability under incomplete alignment; (2) ADP achieves the KLD constraint between the optimal perturbations through the classification loss. However, adversarial transferability does not necessarily occur between the optimal perturbations. The transferability still exists even if the optimal perturbation and the local minimum sub-optimal perturbation have little difference to the model’s output but have a closer KLD. Taking these two issues as the starting point, this article innovatively defines an evaluation metric in the next section for measuring the transferability of adversarial samples and optimizes this metric as the standard term to diversify sub-models as a regular item in network training, which finally isolates the transferability between sub-models.

4 Method

4.1 Preliminaries of SVD and transferability

From the perspective of optimization theory, adversarial sample optimization in (4) aims to maximize \(\left \|J_{f}(x) \delta \right \|_{q}\). When q = 2, the goal of adversarial sample optimization can be simplified to a constrained optimization problem of quadratic functions:

$$ \begin{array}{@{}rcl@{}} maxmize\quad \delta^{T} Q \delta \\ subject\quad to\quad \delta^{T} P \mathcal{S} & = k \end{array} $$
(6)

where Q = JTJ. Since the norm is homogenous, k is set as 1 to solve the optimization problem (6). Given the Lagrangian function \(l(x, \lambda )=x^{T} Q x+\lambda \left (1-x^{T} P x\right )\) for the constrained optimization problem (6), the following Lagrange condition can be obtained:

$$ P^{-1} Q \delta = \lambda \delta $$
(7)

Therefore, the eigenvector of P− 1Q is the optimal corresponding to the solution of the objective function (6). When the perturbation constraint is also under the L2 norm, P is the identity matrix, and the maximum eigenvalue of Q is the maximum value of the cost function (6). It can be seen that the singular vector of the Jacobian matrix J essentially defines the possible local optimal solutions of δ. Thus, the entire singular matrix describes the model’s adversarial distribution. The singular value corresponding to the singular vector reflects the output disturbance that each adversarial distribution can generate, while the maximum singular value defines the maximum output variation of the model under the L2 norm. This also reveals why the spectral norm is a more stringent constraint than the Frobenius norm under the single model’s robustness. Each local optimal perturbation solution is approximated under the first-order condition through the singular value decomposition of the Jacobian matrix. The transferability of adversarial examples essentially becomes the degree of alignment of the singular vectors of different Jacobian matrices [25].

In the DVERGE method [15], the degree of alignment is defined by (8) as the adversarial loss of adversarial examples distilled between sub-models. Transferability is explicitly defined by the loss of adversarial examples under the model. \(x_{f}^{\prime }\) is the ys targeted adversarial example from the source model fi to the targeted model fj,ij based on image x. Due to the randomness of image input pairs\((x, y)\left (x_{s}, y_{s}\right )\), the overall transferability metric through loss function is expressed in terms of expectation.

$$ \begin{array}{@{}rcl@{}} d\left( f_{i}, f_{j}\right)_{DVE}&=&\frac{1}{2} \mathrm{E}_{(x, y)\left( x_{s}, y_{s}\right), l}\left[l_{f}\left( x_{f_{j}}^{\prime}\left( x, x_{s}\right), y\right)\right.\\ &&\left.+l_{f_{i}}^{\prime}\left( x_{f_{i}}\left( x, x_{s}\right), y\right)\right] \end{array} $$
(8)

While minimizing the adversarial loss, the transferability of the largest singular vectors from the source model to the target model can be explicitly characterized, and the effect depends on the adversarial sample generation algorithm and parameters. The starting point of the method and main innovation point of this paper is: how to more reasonably evaluate the alignment degree without relying on the prior information of the adversarial sample and achieve the widely applicable evaluation of transferability.

This paper accurately defines the adversarial transferability through the adversarial distribution characterized by the Jacobian matrix’s singular value decomposition. As shown in Fig. 2, the parameter update based on the classification loss function is considered a constraint on the KLD between the two distributions. Wasserstein GAN (WGAN) [28] indicated that the constraint of the variation between the logits output of different images is essentially a constraint on the Wasserstein distance. Converting to the scenario of adversarial distinguish, the transferability metric accurately defined by adversarial sample in DVERGE [15] can be expressed similarly as the constraint from discrimination of GAN to distinguish the adversarial example right from other sub-models. Without the loss function, the Wasserstein distance is a more efficient optimization objective in optimal transport theory between two distributions [28]. More experiments on other distances can be found in Section 5.1.

Fig. 2
figure 2

The transferability is analyzed by the optimization level set based on the SVD and adversarial distribution. (a) DVERGE illustrates the transferability through target singular value with the smallest KLD; (b) The quantitative metric of transferability based on the singular matrix with smallest Wasserstein distance

4.2 Transferability metric based on Wasserstein distance

Based on the above adversarial distribution and distance characterization related to transferability, the following more accurate assessment of transferability can be defined: Given the singular vector corresponding to the maximum singular value of the source Jacobian matrix, the singular value corresponding to the target Jacobian singular vector under minimizing the Wasserstein distance reveals the approximate transfer adversarial output variation.

The following equation describes the interrelationship of this transferability under the singular value vector (svec) and the singular value (sval):

$$ Measure\_trans_{S \to T} = \frac{\underset{s\_val_{{J_{T}}}}{\arg\max} P\left( \max \left( s\_vec_{J_{s}}\right) \rightarrow s\_vec_{J_{T}}\right)}{\underset{C}{\arg\max} P\left( \max \left( s\_vec_{J_{s}}\right) \rightarrow s\_vec_{J_{T}}\right)} $$
(9)

where the subscripts S and T stand for the source, and the target models, respectively. P is the distribution transfer matrix obtained by calculating the Wasserstein distance, while C is the corresponding transfer work done factor. Algorithm 1 shows the algorithm flow of the method in this section and focuses on the dimensional changes of the Jacobian matrix through the SVD solution process.

Algorithm 1
figure g

Transferability metric based on Jacobian Matrix Singular Value Decomposition.

The CNN model maps high-dimensional data input to low-dimensional feature vectors. After the loss gradient direction projection, the Jacobian matrix still has [batch-size, image-channel, image-size, image-size] dimensions. Among the available multi-way SVD methods [29], Higher-Order SVD (HOSVD) reveals universal adversarial samples when the Jacobian matrix is flattened in the batch-size dimension. According to the adversarial transferability definition that causes misclassification between different models under the same perturbation and sample, the calculation of the transferability metric should be sample-centric between different models. Thus, HOSVD is not a sample-centric transferability evaluation method and has a high computational cost with high-dimensional operations and constraints. Similarly, the two-Way SVD on Average of X method ignores the sample-centric definition of transferability, and the singular values and singular vectors calculated from Canonical Polyadic Decomposition cannot have a sample-to-sample correspondence. The singular value matrix under different samples can be obtained, using the Replicated SVD, and the dimension of the singular value matrix is compatible with the image size to reduce the computational complexity and memory footprint of the GPU operation. In order to obtain sample-centric singular value decomposition regarding the computational cost of the high dimensional matrix, the Jacobian matrix is merged after loss projection in batch-size and image-channel dimensions. Finally, the 2D batched SVD decomposition is achieved through Replicated SVD in the dimension [image-size, image-size]. This is also the main calculation difference compared with the method depending on the adversarial sample generation. Although the proposed method does not generate adversarial samples, it makes a numerical evaluation.

4.3 Training routine of ensemble based on transferability metrics

After calculating the transferability metric between different sub-models for network ensemble training, the obtained metric is added with traditional classification loss as a common term. After obtaining the constraints loss of each sub-model, this section follows the parameter optimization using the ensemble as the whole model. Equation (10) shows the overall loss function for the optimization problem.

$$ \text{ensemble\_loss} = \lambda \sum\limits_{i=1}^{N} l_{\text{class}}\left( f_{i}(x), y\right)+\sum\limits_{i=1}^{N} \text{Metric\_trans}_{i,j\neq i} $$
(10)

Algorithm 2 shows the algorithm flow of the overall optimization process. It indicates the importance of the L2 regularization of the Jacobian matrix and gradient clipping in practice while updating parameters. The validity of the Wasserstein distance solution can be guaranteed under the first-order approximation.

Algorithm 2
figure h

Ensemble network optimization based on transferability metric.

Different experimental settings are all trained in parallel on 4 GPUs. The batch size and the initial training rate are 100 and 0.001, respectively, and the sub-model architectures are all ResNet-20 [30]. The experiment proves the same conclusion as DVERGE during training. Accordingly, an Adam optimizer can obtain better isolate transferability starting from scratch. It should be emphasized that the parameter update in this paper does not depend on any adversarial sample generation process, and there is no random factor from the random initialization of the adversarial samples to improve the diversity. The transferability constraints on the ensemble model are only derived from the evaluation metric.

5 Experiment and results

The experiment was first preliminarily verified on the CIFAR-10 dataset [31]. The benchmark model as Baseline is an ensemble network with ResNet-20 as the sub-model architecture. Each sub-model employs the cross-entropy loss as the optimization goal. The training parameters of batch-size were set as 128, while the learning rate was 0.001. The Adam optimizer is employed for 200 epochs of model training, and the learning rate decays by a factor of 0.1 at 100 and 150 epochs respectively. Taking the baseline model as the benchmark performance, the comparison models such as ADP, GAL, and DVERGE are obtained under the same parameter conditions and model architecture with different transferability constraints in the loss function. Among different constraint methods, the ensemble models of Baseline and Adversarial training [32] set the optimizer and loss function of each sub-model to train independently to ensure the sub-models irrelevance. The sub-model architecture, training parameters, and optimizer are the same as the benchmark setting for the proposed transferability metric. Under the same training parameters, the following section presents a fair and effective comparison of the models trained with different constraints.

5.1 Transferability evaluation of adversarial distributions with different distance functions

In this part, under the theoretical characterization of the adversarial distribution by SVD decomposition, the influence of different distance metric functions as the denominator part of (9) on the performance of the transferability metric is discussed to demonstrate the advantages of the applied Wasserstein distance. According to the isolated effects of DVERGE (DVE) [15], GAL [14], ADP [13] and Baseline (Base) on transferability in previous studies, this paper employs the attack success rate of adversarial samples between sub-models as the gold standard for transferability metric. The degree of consistency between the metric under different distance measures and the gold standard demonstrates the effectiveness of Wasserstein distance (Wass) in the transferability evaluation. In order to compare the distance, distance metric commonly used in machine learning are chosen, including Euclidean distance (L2), Cosine distance (Cos), and Maximum mean discrepancy (MMD) applied in domain transfer problems.

The experiment is performed under the same 2000 CIFAR-10 test dataset, regarding the specific detail in this part. The transferability metric of a sub-model to the other sub-models is obtained for each sample as mentioned in Section 4.2 on batch-size and image-channel dimensions, and the average measurement of all sub-models and samples is finally given as the evaluation result of the ensemble model. Tables 12 and 3 show the contrastive experiments results with 3, 5, and 8 sub-models, respectively. The best transferability isolated model evaluated by the corresponding metric is marked with black font.

Table 1 Contrastive experiments on transferability evaluation through different distance metrics under three sub-models. The best transferability evaluated by the corresponding metric is marked with black front
Table 2 Contrastive experiments on transferability evaluation through different distance metrics under five sub-models. The best transferability evaluated by the corresponding metric is marked with black front
Table 3 Contrastive experiments on transferability evaluation through different distance metrics under eight sub-models. The best transferability evaluated by the corresponding metric is marked with black front

It can be seen from the results that under the transferability metric of the gold standard, the transferability isolation effect from high to low is DVERGE, GAL, ADP, and Baseline, respectively. By evaluating these models under different distance metrics, the transferability results evaluated by Cosine distance are too different from binge a metric. The evaluation results of the other distances are consistent for the best isolation mode. However, the MMD and Euclidean distance differ from the actual results in other gold standards. Generally, the transferability evaluations when applying the Wasserstein distance are most consistent with the actual attack situation, demonstrating the validity of (9) in evaluating transferability under the Wasserstein distance.

5.2 Experiment of transferability evaluation between sub-models

The experiments in this section use different transferability metrics to cross-evaluate different transferability isolated ensemble models to study the correlation between different metrics and demonstrate their effectiveness. The experiment contains four evaluation metrics, including metrics in DVERGE, GAL, our paper, and the success rate of adversarial transfer attacks. In addition to the models obtained by the above three metrics as constraints, this experiment further combines the Baseline without any constraints, the ADP model, the adversarial training (AT) model, and the DVERGE model combined with adversarial training (DVE+AT) for comparative analysis. The attacker algorithm employs PGD untargeted attack to generate adversarial samples under different sub-models to attack other sub-models in the ensemble. Finally, the gold standard for evaluation is the average success rate of transferability attacks under a single sub-model against all other sub-models. According to the evaluation of adversarial samples, the experiments give the results under different disturbance constraints to show their consistency. All experiments are performed under the CIFAR-10 dataset. The metric in DVERGE through classification loss is defined in (8). The metric in GAL is defined by (11) as follows:

$$ d\left( f_{i}, f_{j}\right)_{GAL} = \log \left( \sum\limits_{1\leq a\textless b\leq N} \!\exp \left( \frac{\left<J_{i}, J_{j}\right>}{\left\|J_{i}\right\|_{F}\left\|J_{j}\right\|_{F}}\right)\right) $$
(11)

Equation (8) depicts the degree of variation of the output of adversarial distillation of different sub-models to others through the recognition loss under random target categories. The higher this diversity evaluation metric, the smaller the adversarial sample transfer between sub-models. Conversely, (8) and (9) are directly proportional to transferability under numerical quantification. In order to achieve comparison, this section gives the average of different metrics under 2000 test set samples under different numbers of sub-model ensembles. Tables 45 and 6 show the results under different numbers of ensemble sub-models. The top two methods of transferability evaluated under each metric are marked in black. The percentages behind each model represent the base accuracy on a clean test set sample.

Table 4 Cross-evaluate different transferability isolated metrics under three sub-models. The top two methods of transferability evaluated under each metric are marked in black
Table 5 Cross-evaluate different transferability isolated metrics under five sub-models. The top two methods of transferability evaluated under each metric are marked in black
Table 6 Cross-evaluate different transferability isolated metrics under eight sub-models. The top two methods of transferability evaluated under each metric are marked in black

By observing the occupied proportion of the top two results of different transferability metrics under different ensemble models, the ensemble model obtained based on the proposed metric has superior evaluation consistency to the other metrics. A relatively effective transferability isolation effect of the model is obtained through the gold standard under the L2 and Linf constraints, reflecting an effective transferability evaluation metric. It is worth noting that with the increase in the number of sub-models, the transferability metric constraint-based method is not as good as the adversarial training, indicating that taking the isolated transferability as the optimization goal will cause bottlenecks with the increase in the number of sub-models. However, although adversarial training does not cause any constraints on transferability, it has a good performance in isolating the transferability. The relevant reasons are analyzed in two aspects: (a) The robustness of a single model brought by adversarial training can also be reflected in the transferability improvement through attack success rate; (b) The random initialization of the algorithm for adversarial samples brings more different parameter update directions to the model training. Further discussion and analysis will be given in the subsequent analysis of transferability and robustness.

Excluding the transfer evaluation results related to adversarial training, the proposed method achieves a transfer isolation effect closest to DVERGE, defining transferability explicitly through adversarial examples. Since no adversarial samples are generated, the proposed metric does not have the randomness from the initialization and attack categories of adversarial samples, making the quantification more accurate. Beyond that, it also does not have the diversification conditions for gradient update brought by randomness in the optimization process. It is accurate because the evaluation of the adversarial transferability of the presented method is not related to the complexity of the adversarial sample algorithm and parameters that increase the general applicability of the metric in defense evaluation.

5.3 Experiment of ensemble robustness

After obtaining the isolated transferability performance of sub-models, this section aims to demonstrate the relevant conclusions between the sub-model’s transferability and the ensemble’s robustness through the adversarial robustness analysis. The ensemble robustness experiments are categorized into white-box and black-box attacks. Under the white-box attack, the experiments evaluate the PGD method under different perturbation limits of L2. PGD generates untargeted adversarial perturbations using 50 step size parameters with an eps/5 iteration step size and 5 random initializations. The ensemble robustness of different models under different transfer constraint methods is verified in Fig. 3(a)-(c). The horizontal axis stands for the attack perturbation. In contrast, the vertical one stands for recognition accuracy.

Fig. 3
figure 3

Robustness evaluation results for different perturbations. (a)-(c) the results of PGD white-box attack using L2 norm respectively under 3, 5, and 8 sub-models in the ensemble, respectively; (d)-(f) the results of black-box attack using baseline model as alternative gradient under 3, 5, and 8 sub-models in ensemble, respectively

As shown in Fig. 3(a)-(c), DVERGE’s ensemble model provides better robustness among all methods under white-box attack. DVERGE combined with adversarial training has the worst performance for clean accuracy but the smallest decreasing amplitude of accuracy under the white-box attack, indicating the best robustness. Compared with Baseline, ADP has no noticeable improvement in robustness; GAL cannot provide better robustness than ADP under a small disturbance. However, the robustness of GAL is better than ADP with the increase of perturbation, ranking only second to DVERGE. In contrast, although the proposed method achieves better results than GAL and ADP in transferability, it provides the worst robustness under white-box attacks. Such a conclusion differs from the previous ensemble robustness and transferability conclusions. Through the theoretical analysis of singular decomposition in this paper, this different conclusion may be due to improving the model’s robustness by the previous method.

According to (9), transferability is mainly characterized by the target singular value with the smallest distance according to the source singular vector. Considering the robustness constraint of the largest singular value, reducing the largest singular value (improving the robustness) can, no matter what, indirectly make the singular value with the closest distance decrease, thereby achieving transferability isolation. The GAL method takes transferability as the starting point. However, due to the upper bound constraint condition, GAL essentially constrains the singular value in the optimization without considering the mutual distance of the singular vector. GAL tends to decrease all model’s singular values through sub-model interaction constraints in the optimization process. The transferability upper bound proposed by GAL can obtain a small value under the small F-norm of each the Jacobian matrix of sub-model; that is, the relative robustness of the sub-model through (4). Although the DVERGE method explicitly defines the adversarial transferability, the number of iterative rounds, the iterative step size, and initialization direction will affect the convergence rate of the adversarial samples through the gradient descent perspective of adversarial optimization. The small number of iterations (DVERGE sets it as 10) and random initialization conditions (random initialization of adversarial samples and randomization of adversarial classes) in the adversarial example generation process may cause incomplete convergence of distilling adversarial examples and some extent, achieve the adversarial training on other sub-models. This also explains why different iteration step sizes affect the robustness of the ensemble through the DVERGE method.

The metric in this paper does not constrain the maximum singular value of sub-models, which is an important aspect that distinguishes the proposed method from others. Only looking at the success rate of the adversarial transfer attack, although the adversarial training can also achieve a good transfer isolation effect, this transfer isolation effect is based on the overall reduction of singular values. Even though DVERGE achieves excellent isolation of transferability, the best robust performance should be further improved through adversarial training. Based on Tables 45 and 6, DVERGE with adversarial training has a decline in transferability. This also shows from another perspective that there is no correlation between ensemble robustness and transferability. The transferability metric adopted in this paper accurately disentangles the correlation between transferability and robustness under ensemble. Although good robustness can achieve better transferability isolation, transferability isolation cannot achieve robustness even under ensemble conditions.

The experiments on black-box adversarial samples are derived from the transferable adversarial samples generated by different attack methods using the baseline model as an alternative model in the DVERGE. Three types of attack methods, including PGD [5], M-DI2-FGSM [33], and SGM [34], are generated, and the final accuracy rate is calculated comprehensively under different types of adversarial samples. Each sample collects 30 adversarial counterparts for black-box testing according to the permutation and combination of the loss function, the number of surrogate models, and different attack methods. In terms of evaluation criteria, according to DVERGE’s settings, the robustness of the model to this one sample can be deduced if the 30 types of adversarial samples are correctly identified. Figure 3(d)-(f) presents the black-box attack evaluation results. Due to transferability isolation for more transfer attacks, the proposed method provides better results than the Baseline and ADP. However, it is still poor compared with the DVERGE and GAL methods in terms of robustness.

The above black-box experiments show that the isolation of transferability has relatively limited black-box robustness. In order to further clarify the advantages of transferability isolation, this section assumes different attack scenarios to explore the key of transferability in adversarial example defense. When the model parameters are treated as protected private data, the attacker suffers from limited knowledge of the sub-models. The experiment evaluates the influence of partial model leakage to attackers on the robustness of the ensemble model. The ensemble model is attacked according to the untargeted adversarial samples of each sub-model under the white-box condition, and the average value of the attack success rate under different sub-models is taken as the evaluation parameter of the robustness result. The algorithm adopts the PGD attack under the L2 constraint for adversarial samples. The iteration round is 50, the iteration step size is eps/5, and the number of random initializations is 5. Table 7 presents the relevant results, while the best results are marked in red.

Table 7 Robustness analysis of the ensemble model in the case of leakage of a single sub-model

Based on Table 7, the ensemble model optimizing the transferability metric in this paper has good robustness under the white-box attack caused by the leakage of some sub-model parameters. This robustness increases with the number of sub-models. This also shows that the improvement of transferability to robustness does not lie in traditional white-box or black-box attacks scenarios but in the white-box transfer attacks scenario caused by leaking some model parameters. The isolation of transferability without constraining the maximum singular value helps the white-box attacker steer the generation of adversarial examples in different directions, reducing the impact of model data leakage on the overall model robustness.

5.4 Robust radius analysis based on the decision boundary

The above contradictory correlation results are found through different experimental settings. This combines the analysis of singular values to disentangle the correlation between robustness and transferability. In the previous analysis, the attack success rate of multiple samples was mainly utilized as the same gold standard of transferability and robustness. Based on the results in Tables 45, and 6, it is impossible to judge what factors influence the attack success rate. This also leads to inconsistency between different metrics and gold standards in the comparison, which is not conducive to the conclusion of the disentanglement transferability and robustness. In order to further illustrate the correlation between robustness and transferability and demonstrate our conclusion, this section uses the robust radius represented by the decision boundary under a single sample to make a complete demonstration.

The decision boundary sample points are within a 2D plane span of an adversarial direction and a random Rademacher vector around a testing image. Then, the classification output of these sampled points is evaluated and visualized in the plot. The model prediction under different perturbations is drawn considering the gradient direction obtained by the sub-model through loss function as the vertical axis and the random Rademacher direction as the horizontal axis. Different colors represent different categories. Figure 4 presents the relevant results under the 3 sub-models of the ensemble.

Fig. 4
figure 4

Decision boundaries on the same sample for 3 sub-models under different transferability constraints

Taking the baseline model as a robust benchmark, it can be seen that the baseline has certain robustness in the sub-model and the final ensemble, rather than being completely easy to attack. AT significantly expands the robustness range (blue area in the graph) in both the sub-model and the ensemble compared with the baseline model. Based on the analysis in the previous section, although the GAL and DVERGE methods take transferability as the starting point, they still promote the model’s robustness to some extent. This improvement in robustness is reflected in the decision boundary as the expansion of the blue area compared to the baseline model. In contrast, due to the lack of constraint on the largest singular value, the isolated transferability model under the proposed method even shows more fragile robustness than the baseline. The evaluations of this robust radius are the same as the results of the robustness experiments under white-box attacks. It is also confirmed that the improvement of the ensemble robustness of previous research is partly due to the imprecise definition of transferability, making constraints tend to optimize the robustness of sub-models. It further confirms our disentanglement conclusion about the relationship between robustness and transferability in the previous section; that is, the isolation of transferability cannot achieve the robustness of the ensemble.

5.5 Transferability contrastive analysis under ImageNet

In order to evaluate the effectiveness of the proposed method on different benchmark datasets, this section conducts experiments on different transferability constraint methods on ImageNet datasets [35]. This section reproduces different transferability constraints of DVERGE, ADP, GAL, Baseline, and AT under the ImageNet dataset. The benchmark model is an ensemble model with 3 ResNet50 sub-models, where each sub-model employs the cross-entropy loss as the optimization goal to train. The optimizer is SGD with 100 epochs. Among different constraint methods, the ensemble models of Baseline and AT set the optimizer and loss function of each sub-model to train independently to ensure the sub-models irrelevance. A unified optimizer applies the other constraint methods with a joint loss function to ensure that the adversarial transferability is not affected by random factors in the optimization process. For the training hyperparameters, the batch-size is 256, while the learning rate is 0.1. The learning rate is reduced by 0.5 after every 10 epochs. Due to the significant memory footprint of the Jacobian matrix, the proposed method sets the batch-size to 128, and other training hyperparameters and sub-models for the ensemble are compatible with other methods. All models are trained by parallel acceleration calculated with 8 RTX3060s.

After obtaining models with different loss function constraints under the ImageNet dataset, experiments are performed on the transferability of sub-models and the ensemble robustness for the single-model leakage to detect differences in conclusions on small-scale datasets. As discussion in [5], the larger the attack capability, the worse attack performance to ensure the generalization of adversarial samples for the transfer-based attack in the case of significant model capability. Therefore, this section evaluates the transferability using FGSM attack algorithm with L2 constraint. The evaluation criteria are compatible with Sections 5.2 and 5.3 while the evaluation of accuracy and attack success rate are discussed under the top1 accuracy rate. Table 8 shows the relevant results.

Table 8 Experiments on transferability and robustness on the ImageNet with FGSM attack under L2 constraint

By analyzing the adversarial transferability, the Baseline can achieve transferability isolation under FGSM attacks through random initialization of sub-models and independent SGD optimization. It illustrates the negative correlation between network and attack capability with adversarial transferability, and indicates the singnificant effect of randomness factors on transferability improvement under a large-scale dataset. Compared with the Baseline, DVERGE, GAL, ADP, and AT provide a certain degree of improvement in transferability isolation. Unlike the conclusion under the CIFAR-10, the transferability isolation effect of ADP is more effective than that of GAL, reflecting the specific performance boundaries based on the first-order analysis under the ImageNet. Furthermore, the proposed method achieves better transferability isolation performance than the ADP and GAL in a small perturbation range. However, this conclusion does not hold for larger perturbations, indicating that the effectiveness of the first-order analysis is limited by the perturbation range for larger models and datasets. The proposed method achieves the consistent conclusion as CIFAR-10 in a small perturbation range under the ImageNet. The main reason is that the complexity in the model architecture and the classification task further narrows the approximate conditions for the perturbation range in (3). Thus, the first-order analysis is not always sufficiently accurate under all perturbations. Generally, the influence of network depth and wide on the approximation degree of the first-order analysis determines the complexity of transferability analysis for different datasets, which is also common boundedness in existing research on robustness and even transferability on ImageNet through the first-order analysis.

6 Conclusion

This paper takes the adversarial samples’ transferability between sub-models as the starting point for the study of ensemble robustness. Through the first-order approximation analysis under Lagrange conditions of optimization theory, this paper characterizes the model’s adversarial distribution and output variation based on the singular value decomposition of the Jacobian matrix. Based on this theory, the level set of the gradient optimization theory can analyze the shortcomings of the previous transferability metric. This paper effectively redefines the transferability metric between models by performing optimal transport theory on the singular matrix. Given the singular vector corresponding to the maximum singular value of the source Jacobian matrix, the singular value corresponding to the target Jacobian singular vector under minimizing the Wasserstein distance reveals the approximate output variation. The sub-models obtained by this transfer metric as a standard term achieve the best transfer isolation performance without prior information of adversarial samples. Such a definition has more general applicability in defense evaluation as the mathematical analysis of model attributes in the case of complex parameters and algorithms of adversarial sample generation. Further ensemble robustness experiments and theoretical analyses disentangle the correlation between robustness and transferability. The alternative transfer attack under partial model parameter leakage more reflects the ensemble robustness of transferability. In future research, there are some potential research directions: (1) the robustness of ensemble based on transferability should consider richer ensemble strategies, so that the isolation of transferability can improve the robustness; (2) Except for the Replicated SVD method for multiway Jacobian data, singular values and vectors obtained using HOSVD can be an important research direction to discuss more properties of adversarial samples; (3) On large datasets such as ImageNet, the effective transferability metric based on the high-order analysis with greater capability should be further discussed and studied.