1 Introduction

The increasing deployment of deep learning-based AI systems in real-world scenarios has raised serious concerns about their security [8, 76, 88]. Adversarial examples, specifically crafted to evade AI system’s security measures, pose a significant threat to the trustworthiness and reliability of these systems [12, 31, 45, 51, 87, 95, 101]. Numerous defense mechanisms have been proposed to mitigate the potential impact of adversarial examples on AI systems [10, 59, 71, 96]. However, the constant emergence of new attack methods has rendered previous defenses ineffective. For instance, the Carlini & Wagner (C &W) attack [12] and backward pass differentiable approximation (BPDA) attack [6] have successfully compromised defense mechanisms that were previously considered robust. This has resulted in an ongoing arms race between attack and defense.

In the face of the complexities of practical application scenarios and the threat of malicious attackers, adversarial examples may not be limited to small \(L_{p}\)-norm adversarial perturbations [94]. Attackers can modify images on a large scale while maintaining semantic similarity [36, 43, 73, 81], or generate adversarial examples through spatial transformations such as rotation or translation [24, 94]. Furthermore, unrestricted adversarial attacks have become a new research focus in recent years [9, 85]. These developments pose significant challenges to existing defense methods.

Adversarial example detection aims to identify adversarial perturbations in image inputs, providing deep neural network (DNN) models with enhanced security without compromising their ability to recognize normal examples. Various techniques exist for detecting adversarial examples, including using distinct behavioral features [26, 34] or leveraging observed behavioral differences in a DNN model’s middle layer [25, 56]. Another approach assumes unique statistical properties between adversarial and normal examples to train a classifier [32, 59]. However, these methods demand a large number of adversarial examples, resulting in high time and sample complexity, and generally only detect the same type of adversarial examples. Although some studies have explored unknown adversarial example detection [27, 49, 61, 92], challenges persist in detecting emerging attacks such as semantic attack (SA), spatial transformation attack (STA), AutoAttack (AA), and composite adversarial attack (CAA). As these attacks rapidly evolve, adversarial example detection faces limitations, necessitating the ability to identify unknown and new types of adversarial examples.

Meta-learning adversarial detection (MetaAdvDet) is a representative method for detecting unknown adversarial examples using meta-learning techniques [57]. The MetaAdvDet method uses the model agnostic meta-learning (MAML) [28] approach to study the problem of adversarial example detection from a fast adaptation perspective. However, according to [74], feature reuse is the main factor for the success of MAML in few-shot learning. Inspired by [41], the vulnerability of the model caused by adversarial examples is due to the presence of non-robust features, and models trained by standard training methods are able to exploit such highly predictive features. Therefore, we aim to extract these features from adversarial examples and use them to differentiate from those of normal examples to detect adversarial examples. To achieve this, we use a prototypical network [83] to learn the corresponding feature representations from the adversarial and normal examples, which are clustered around their respective class prototypes in the feature space. The class prototypes are obtained by computing the average of a small number of labeled examples in the feature space. Since the unknown adversarial attack method can use multiple methods to generate adversarial examples, which may differ significantly from the features of the adversarial examples in the training stage, we further process the features extracted from the backbone network in the meta-testing stage and update the class centers using the maximum A posteriori (MAP) algorithm [39].

Fig. 1
figure 1

Overall framework of the proposed PAD approach. The approach consists of two distinct stages: meta-training (upper half of the figure) and meta-testing (lower half of the figure), both utilizing the same feature extraction backbone network \(f_{\theta }\). Upper: The backbone network is trained on known adversarial example detection tasks in multi-task format. Lower: The learned backbone network \(f_{\theta }\) is employed to extract features of all samples on unknown adversarial example detection tasks. For each test task, the support set is used to compute the initial class center. After applying the CL2N transformation and MAP algorithm, the query set labels \(\hat{\textbf{Y}}\) can be computed. More details can be found in Algorithms 1 and 2

The overall framework of the method is shown in Fig. 1. Each task represents a small data collection process, including normal examples and one type of randomly selected adversarial examples, simulating new attack scenarios [57]. The meta-training stage trains the feature extraction backbone network on multiple known adversarial example types, while the meta-testing stage tests the detection method’s performance on unknown adversarial example types. The method, referred to as PAD (ProtoNet Adversarial Detection), uses a prototypical network to train an end-to-end feature extraction network, processes extracted features of unknown adversarial examples, and predicts using the optimal mapping matrix. It can detect unknown adversarial examples under few-shot conditions. Specifically, our contribution is as follows:

  1. (1)

    Due to the high time complexity associated with the two-layer optimization employed by the MAML method during training [5], it is customary to utilize a shallow backbone network. The use of a deeper network can incur a substantial training time overhead. In this regard, we propose the use of residual networks for feature extraction from both normal and adversarial examples within our PAD method. The network design incorporates a 7 \(\times \) 7 convolutional kernel while removing average pooling in the initial layer. Subsequently, we calculate the Euclidean distance between unknown adversarial examples and class prototypes to facilitate discrimination between adversarial and normal examples. This approach greatly improves the method’s performance for detecting unknown adversarial examples.

  2. (2)

    We propose applying feature transformation to the features extracted from the backbone network, in conjunction with an iterative algorithm based on optimal transport theory for updating class centers, which further enhances the detection performance of unknown adversarial examples.

  3. (3)

    Our proposed method exhibits a significant improvement in detecting unknown adversarial examples compared to existing few-shot learning algorithms on the cross-adversary benchmark for MNIST and CIFAR-10 datasets under both 1-shot and 5-shot settings.

  4. (4)

    We extend the proposed method to the ImageNet dataset and conduct experiments to evaluate its performance in detecting two new adversarial attacks, AA and CAA. The results of the experiments clearly demonstrate the superiority of the method.

2 Related Work

2.1 Adversarial Attack

An adversarial attack involves introducing specific perturbations to create adversarial examples, designed to prompt deep neural networks to produce erroneous predictions without affecting human judgment. Adversarial attacks are classified into \(L_{p}\) and non-\(L_{p}\) attacks, based on their algorithms [53, 89]. Most current attack algorithms utilize \(L_{p}\)-norm perturbation imperceptibility metrics to generate adversarial examples that lead to incorrect decision-making in the target model [1, 33].

The \(L_{p}\)-norm adversarial example is denoted by \(\textbf{x}_{adv}=\textbf{x}+\varvec{\delta }\), wherein \(\varvec{\delta }\) is derived by solving Eq. 1:

$$\begin{aligned} \max _{\varvec{\delta } \in \Delta } \mathcal {L}\left( f_\theta (\textbf{x}+\varvec{\delta }), y_{\text {true }}\right) , \textit{s}. \textit{t}. \Delta =\left\{ \varvec{\delta }:\Vert \varvec{\delta }\Vert _p<\varepsilon \right\} \end{aligned}$$
(1)

\(\mathcal {L}\) denotes the model’s loss function, typically characterized as cross-entropy loss, with p taking values 0, 1, 2, and \(\infty \), each corresponding to distinct \(L_{p}\)-norm adversarial examples [80]. \(L_0\) measures the number of pixels that can be perturbed; \(L_2\) measures the Euclidean distance between \(\textbf{x}\) and \(\textbf{x}_{adv}\); \(L_{\infty }\) measures the maximum alterable distance across all pixels. Perturbations based on the \(L_{p}\)-norm constraint are inadequate for measuring perceptual similarity [80]. Adversarial examples can be generated by altering color [36, 78], texture [7], spatial location [24, 94], and other factors, thus inducing model misclassification while preserving semantic or structural information. In the absence of \(L_{p}\)-norm restrictions, an attacker may introduce extensive and conspicuous modifications to an image, prompting model misclassification without affecting normal human perception. Such examples are referred to as unrestricted adversarial examples [1, 9, 85]. Bhattad et al. [7] demonstrate that adversarial training methods based on \(L_{p}\)-norm examples are not resilient to these adversarial examples.

Vulnerability assessment evaluates the target model’s susceptibility to performance degradation when confronted with sophisticated perturbations, emphasizing metrics like attack success rate or robust accuracy. AA and CAA represent two novel adversarial attack methodologies. AA is an \(L_{p}\) attack, while CAA additionally encompasses non-\(L_{p}\) attacks. AA combines improved PGD methods with FAB [18] and Square Attack [4], forming a collection of attacks commonly used for evaluating robustness. CAA defines a sequential step, creating a powerful strategy against defense models by automatically searching for optimal attack combinations among 32 algorithms and parameters.

2.2 Adversarial Defense

Adversarial defense methods fall into two categories. The first aims to enhance model robustness, aligning the model’s perception with human perception and ensuring accurate prediction of adversarial examples. Adversarial training is a key method for improving robustness but faces challenges from robustness trade-offs [99] and gradient masking issues [89]. An increase in adversarial robustness may reduce accuracy for normal examples [90], and \(L_{\infty }\)-norm-based adversarial training models may lack robustness against \(L_1\)-norm or \(L_2\)-norm adversarial examples [53, 85, 89].

Another defense strategy is adversarial example detection, which identifies adversarial perturbations in input samples during the model’s prediction phase. If detected, the sample is rejected; otherwise, it is processed by the target model. The detection of adversarial examples serves as a critical defense mechanism against adversarial attacks, adding an extra layer of security for practical deployment of DNN models. The detection task need to address the challenge of identifying potentially malicious samples given the acknowledgment of the model’s vulnerability. The detector’s focus lies on the detection capability of the detector, specifically its ability to accurately distinguish between adversarial and normal samples. Detection methods are classified as supervised or unsupervised based on their use of adversarial example information [2]. Supervised detection methods include network invariant [13, 25, 56, 64, 72], auxiliary model [44, 82, 104] and statistical methods [17, 26, 32, 49, 50, 59]. Unsupervised methods are further divided into network invariant [58], auxiliary model [3, 68, 86], statistical [34, 35, 102], object-based [29], denoiser [63, 84], and feature-squeezing methods [52, 97]. Additionally, few-shot learning detection methods, such as the meta-learning approach employed by [57], named MetaAdvDet, address the challenge of detecting new adversarial attacks with limited examples by modeling detection as a few-shot learning problem and utilizing a double-network framework for rapid adaptation to new attacks.

The MetaAdvDet method innovatively utilizes a meta-learning approach to detect adversarial examples. However, it primarily relies on the fast adaptation ability of the MAML algorithm itself, which remains deficient in explaining the nature of adversarial examples, thus limiting the room for further improvement of its method. Our method utilizes a metric learning-based approach to conduct research around feature extraction of adversarial and normal examples, which addresses some of the shortcomings in [57] and improves the generalization for detecting unknown adversarial examples. The motivation and experimental focus of our proposed method are quite different from [57], mainly in the following three aspects.

Firstly, while MetaAdvDet tackles the problem of detecting unknown adversarial examples by emphasizing the algorithm’s ability to adapt quickly, our approach builds on the work of [41], which examines adversarial examples from a feature-based perspective. As a result, our choice of algorithms fundamentally differs from that of MetaAdvDet.

Secondly, although MetaAdvDet proposes a framework for detecting unknown adversarial examples, it provides limited results in terms of detecting such examples. In contrast, our method not only reports the detection results of the two missing types of adversarial examples identified in [57], but also extends these results to a subset of the ImageNet dataset. This significantly surpasses the capabilities of existing methods.

Thirdly, the cross-domain and cross-architecture test results presented in [57] are not directly relevant to the detection of unknown adversarial examples. In contrast, our proposed method places a greater emphasis on detecting unknown adversarial examples. The design of the backbone network and the selection of modules are based on the feature extraction of adversarial examples as the starting point, enabling us to provide more comprehensive detection results for unknown adversarial examples.

3 Approach

3.1 Overview

Figure 1 illustrates the proposed method’s framework, including meta-training and meta-testing stages. Each task consists of a support set and a query set. In the meta-training stage, the support set is used to calculate the class prototype, and adversarial examples are identified based on the distance between the prototype and the query set features. The meta-testing stage involves the backbone network extracting features from unknown adversarial examples in each task’s support and query sets. These features undergo Center and \(L_2\)-normalize (CL2N) transformation [93] and the MAP algorithm (lines 611 in Algorithm 2) [39] to update the class center. After iterative steps, probabilities for normal and adversarial examples in the query set are obtained. The meta-testing stage’s support set contains 1-shot or 5-shot adversarial examples, enabling the framework to detect new adversarial attacks with few-shot examples.

3.2 Backbone Network

In few-shot learning, a deeper backbone network can reduce intra-class variation [16]. Influenced by the non-local global self-attention mechanism of transformers, recent convolutional network architectures have embraced the concept of using large convolutional kernels [22, 55]. The motivation behind this choice, as discussed in [22], lies in the increased effective receptive fields (ERFs) and the introduction of more shape bias. Our ablation experiments also indicate that smaller kernel sizes are less effective at identifying adversarial examples. Thus, our prototypical network employs ResNet-10, a simplified version of ResNet-18 [16], with a 7 \(\times \) 7 kernel size in the first convolutional layer, as opposed to MetaAdvDet’s 3-layer convolutional network. Adversarial perturbations with small \(L_{p}\)-norm resemble hidden image perturbations in steganalysis [14]. Pooling operations in residual networks may hinder the extraction of noise-like perturbation features, leading to their removal after the first convolutional layer. The designed backbone network structure is illustrated in Fig. 2.

Fig. 2
figure 2

Feature extraction backbone network architecture

3.3 Task Construction

An adversarial example dataset is created by applying adversarial attack methods to a selected dataset and dividing it into training and test sets. Meta-training tasks are derived from the training set, while meta-testing tasks come from the test set. Each task, containing one adversarial example type, is split into support and query sets with normal and adversarial examples. Support set sample number depends on a specified shot, such as 1-shot or 5-shot. To simulate unknown attack detection scenarios, meta-testing tasks require distinct adversarial types, maintaining the same support set sample number as in the meta-training stage, and using a query set to evaluate the detector’s performance.

3.4 Meta-training Stage

The original dataset is denoted as \(\mathcal {D}=\mathcal {D}_{\text {train}} \cup \mathcal {D}_{\text {test}}\), and the set of adversarial attack algorithms is represented by \(\mathbb {A}=\left\{ \mathcal {A}_1, \mathcal {A}_2, \ldots , \mathcal {A}_n\right\} \), where n is the total number of attack operations. The adversarial example datasets for the meta-training and meta-testing stages are given by \(\mathcal {D}_{\text {meta\_train}}=\mathcal {D}_{\text {adv}}\cup \mathcal {D}_{\text {train}}=\mathcal {A}\left( \mathcal {D}_{\text{ train }}\right) \cup \mathcal {D}_{\text {train}}\) and \(\mathcal {D}_{\text {meta\_test}}=\mathcal {D}_{\text {adv}}^{\prime } \cup \mathcal {D}_{\text{ test }}=\mathcal {A}^{\prime }\left( \mathcal {D}_{\text {test}}\right) \cup \mathcal {D}_{\text {test}}\), respectively, with the condition that \(\mathcal {A} \cap \mathcal {A}^{\prime }=\varnothing \), \(\mathcal {A},\mathcal {A} ^{\prime }\subseteq \mathbb {A}\).

The adversarial example detection task is a 2-way s-shot learning task, as category labels are divided into normal and adversarial examples. Meta-training tasks \(\mathcal {D} _{\textrm{task}}\) are formed by random sampling from \(\mathcal {D}_{\text {meta\_train}}\). Each task \(\mathbb {T} _t\) includes a support set \(S^t=\left\{ \left( \textbf{x}_1, y_1\right) , \ldots ,\left( \textbf{x}_{2\,s}, y_{2\,s}\right) \right\} \) and a query set \(Q^t=\left\{ \left( \textbf{x}_1, y_1\right) , \ldots ,\left( \textbf{x}_{2q}, y_{2q}\right) \right\} \). The variable y represents category labels, and \(S_{j}^{t}\) denotes the set of samples with label j in the support set, which contains s samples per class, while the query set has q samples per class.

Denote the backbone network as \(f_{\theta }\), extracting feature \(\textbf{f}\) from input sample \(\textbf{x}\) as \(\textbf{f}=f_{\theta }\left( \textbf{x}\right) ,\textbf{f}\in \mathbb {R}^D\) with learnable parameter \(\theta \). The extracted feature is obtained after applying the ReLU function in the backbone network, ensuring non-negative feature components. Denote \(\textbf{o}_{j,j\in \{0,1\}}\) as the class prototype of task \(\mathbb {T}_t\), with \(\textbf{o}_j\in \mathbb {R}^D\). Each class prototype is computed by averaging the support set sample features as follows:

$$\begin{aligned} \textbf{o}_j=\frac{1}{s} \sum _{\left( \textbf{x}, y\right) \in S_j^t} f_\theta (\textbf{x}) \times \mathbb {1}\{y=j\} \end{aligned}$$
(2)

Given a distance function \(d:\mathbb {R}^D \times \mathbb {R}^D \rightarrow [0, +\infty )\), a distribution over classes for a query point \(\textbf{x}\) is obtained by applying a softmax over distances between the features and class prototypes as follows:

$$\begin{aligned} p_\theta (y=j \mid \textbf{x})=\frac{\exp \left( -d\left( f_\theta (\textbf{x}), \textbf{o}_j\right) \right) }{\sum _{j^{\prime }} \exp \left( -d\left( f_\theta (\textbf{x}), \textbf{o}_{j^{\prime }}\right) \right) } \end{aligned}$$
(3)

Learning proceeds by minimizing the negative log-probability \(J(\theta )=-\log p_\theta (y=j \mid \textbf{x})\) of the true class label. The metric employed for measuring distance is the squared Euclidean distance. By adopting a negative value for this metric, the distance is transformed into a measure of similarity. This means that the smaller the distance, the higher the similarity score between the sample and the prototype. The meta-training algorithm is illustrated in Algorithm 1. The similarity scores corresponding to \(j=0\) and \(j=1\) are represented as “scores”. The cross-entropy function is then calculated between these scores and the true label.

Algorithm 1
figure a

PAD meta-training procedure

3.5 Meta-testing Stage

Unknown adversarial example detection tasks \(\mathcal {D} _{\textrm{task}}^{\prime }\) are randomly sampled from \(\mathcal {D} _{\textrm{meta}\_\textrm{test}}\). The backbone network \(f_{\theta }\), obtained in the meta-training stage, is employed to extract features from the support and query sets. The support set contains labeled samples with extracted features denoted as \(\textbf{F}_S\), while the query set contains unlabeled samples with extracted features denoted as \(\textbf{F}_Q\). For each \(\textbf{f}\in \textbf{F}_S\cup \textbf{F}_Q\), \(l(\textbf{f})\) represents the label of the corresponding sample, while \(\textbf{o}_{j,j\in \{0,1\}}\) signifies the class center corresponding to class j. Given that feature transformations have a positive impact on few-shot learning tasks [93], the feature vector \(\textbf{f}\) undergoes a CL2N feature transformation as per Eq. 4, resulting in \(\hat{\textbf{f}}\).

(4)

The prototype network approach assumes the existence of an embedding space where samples cluster around a single class center, i.e., the prototype. Therefore, clustering can be used to classify the query set. Let \(\textbf{P}=\left( p_{ij}\right) _{2q\times 2}\) represent the probability matrix, with each element \(p_{ij}\) signifying the likelihood that an unlabeled sample i is assigned to class center \(\textbf{o}_{j}\). Define \(\textbf{M}=\left( m_{ij}\right) _{2q\times 2}\) as the cost matrix, where \(m_{ij}=\left\| \textbf{f}_i-\textbf{o}_j \right\| _2^2\) denotes the Euclidean distance between sample i and class center \(\textbf{o}_{j}\). The optimization objective of the clustering problem is formulated as:

$$\begin{aligned} \begin{aligned} \underset{\tilde{p}_{i,j},\tilde{\textbf{o}}_j}{\min } \sum _{i=1}^{2q}{\sum _{j=0}^1{p_{ij}m_{ij}}} \end{aligned} \end{aligned}$$
(5)

Different approaches to solving Eq. 5 can be derived by applying various assumptions to the probabilities \(p_{ij}\) and class center \(\textbf{o}_{j}\) in the above optimization objective. Three such methods will be introduced.

(1) Nearest Class Mean (NCM) method. The original prototype network classifies query set samples with the NCM method, where distances between query set samples and the prototype are calculated to predict class membership. In Eq. 5, the class center \(\textbf{o}_{j}\) remains constant, computed solely from the support set samples. Given the distance function \(d: \mathbb {R}^D\times \mathbb {R}^D \rightarrow \mathbb {R}^+\), the optimal solution to Eq. 5, taking into account the normalization of probabilities, results in:

$$\begin{aligned} \begin{aligned} p_{ij^*}=\frac{1}{2q},j^*=\textrm{arg} \min _jd(\textbf{f}_i,\textbf{o}_j ) \end{aligned} \end{aligned}$$
(6)

However, few-shot learning’s inherent sample scarcity leads to biases in class center computation, as only a few samples are used. Reference [54] identifies two bias types affecting class center estimation and theorizes that using a larger sample set can improve the lower bound of the expected performance. Consequently, incorporating more unlabeled samples within the unsupervised clustering framework can diminish bias in class center estimation.

(2) K-means method. If Eq. 5 satisfies the condition specified in Eq. 7, then it represents the optimization objective of the K-means algorithm, an unsupervised clustering method. The samples in the query set are uniformly distributed, inversely related to the total number of samples in the set, as indicated by the equation \(\sum _{j=0}^1{p_{ij}}=\frac{1}{2q}\). Notably, since \(p_{ij}\in \left\{ 0,{1}/{2q} \right\} \), each sample is assigned to only one class center, reflecting a hard assignment approach. In optimizing the objective Eq. 5, the support set is employed to initialize the class center. The K-means algorithm then iteratively refines the class center estimation. Upon updating the class center, Eq. 6 is applied for prediction.

$$\begin{aligned} \begin{aligned} \sum _{j=0}^1{p_{ij}=\frac{1}{2q},p_{ij}\in \left\{ 0,\frac{1}{2q} \right\} ,\forall i\in \left\{ 1,...,2q \right\} },j\in \left\{ 0,1 \right\} \end{aligned} \end{aligned}$$
(7)

(3) MAP method. The hard assignment approach employed by the K-means method renders the inference process non-differentiable [75]. Furthermore, adversarial examples derived from normal samples show varying distances from the original class after neural network transformations, depending on the type of perturbation. This variance necessitates a soft assignment method, shifting the constraint from \(p_{ij}\in \left\{ 0,{1}/{(2q)} \right\} \) to \( p_{ij}\geqslant 0 \). This adjustment ensures all samples contribute to the current class center calculation, reducing detection errors in scenarios with high intra-class variance and low inter-class variance.

Our approach to detecting adversarial examples aligns with the typical few-shot learning setup, where all the classes within the query sets are equally likely. This is known as the uniform prior on the class distribution, which represents an equal probability scenario of encountering either an attack or normal environment. The equation \(\sum _{i=1}^{2q}{p_{ij}=\frac{1}{2}}\) symbolizes this uniform class prior, indicating equal likelihood of both adversarial and normal classes appearing in the query set. Defining \(\textbf{r} \) and \(\textbf{c} \) as \(\textbf{r}=[\frac{1}{2q},...,\frac{1}{2q}]^{\textsf {T}}\) and \(\textbf{c}=[\frac{1}{2},...,\frac{1}{2}]^{\textsf {T}}\), and coupled with the soft allocation condition \( p_{ij}\geqslant 0 \), we reformulate the optimization objective as follows:

$$\begin{aligned} \begin{aligned}&d_\textbf{M}(\textbf{r},\textbf{c})=\mathop {\min }_{\textbf{P}\in \mathbb {U} \left( \textbf{r},\textbf{c} \right) }<\textbf{P},\textbf{M}>=\mathop {\min }_{\textbf{P}\in \mathbb {U} \left( \textbf{r},\textbf{c} \right) }\sum \nolimits _{ij}^{}{p_{ij}m_{ij}}, \text {where} \\&\mathbb {U}(\textbf{r}, \textbf{c})=\left\{ \textbf{P} \in \mathbb {R}_{+}^{2 q \times 2} \mid \textbf{P} \textbf{1}_{2}=\textbf{r}, \textbf{P}^{\textsf {T}} \textbf{1}_{2 q}=\textbf{c}\right\} \end{aligned} \end{aligned}$$
(8)

The symbol \(\textbf{1}_d\) denotes a d-dimensional vector of ones. The Eq. 8 quantifies the cost of transitioning from the marginal distribution \(\textbf{r}\) to \(\textbf{c}\) via a transport matrix \(\textbf{P}\). Here, \(\mathbb {U} \left( \textbf{r},\textbf{c} \right) \) denotes all possible transport schemes. This process results in the distance \(d_\textbf{M}(\textbf{r},\textbf{c})\), leading to the optimal transport scheme \(\textbf{P}^*\) as per optimal transport theory. The Sinkhorn iterative algorithm, noted for its efficiency, is particularly effective in solving this problem. The optimal transport matrix \(\textbf{P}^*\) can be obtained as follows:

$$\begin{aligned} \begin{aligned} \textbf{P}^*&=Sinkhorn\left( \textbf{M},\textbf{r},\textbf{c}, \lambda \right) \\&=\mathop {\textrm{arg}\max }_{\textbf{P}\in \mathbb {U} \left( \textbf{r},\textbf{c} \right) }\sum \nolimits _{ij}^{}{p_{ij}m_{ij}-\frac{1}{\lambda }h\left( \textbf{P} \right) } \end{aligned} \end{aligned}$$
(9)

The entropy of \(\textbf{P}\), denoted as \(h\left( \textbf{P}\right) \), can be expressed as:

$$\begin{aligned} h\left( \textbf{P}\right) =-\sum _{ij}{p_{ij}\log p_{ij}} \end{aligned}$$
(10)

As \(\lambda \) increases, the influence of information entropy diminishes, and the cost matrix exerts a stronger impact on the final result. Based on Theorem 2 of [20], the solution of Eq. 9 is unique and assumes the following form:

$$\begin{aligned} \textbf{P}^*=\textrm{diag}\left( \textbf{u} \right) \exp \left( -\lambda \textbf{M} \right) \textrm{diag}\left( \textbf{v} \right) \end{aligned}$$
(11)

The vectors \(\textbf{u}\) and \(\textbf{v}\) can be determined using Sinkhorn’s fixed point iteration algorithm, which can then be substituted into Eq. 11 to obtain \(\textbf{P}^*\). In calculating the cost matrix \(\textbf{M}\) in Eq. 11, the class centers are first initialized using the samples from the support set. After obtaining \(\textbf{P}^*\), the class centers are re-estimated according to Eq. 12 [39] using the MAP method.

$$\begin{aligned} \textbf{o}_j\leftarrow \textbf{o}_j+\alpha \left( \varvec{\mu }_j-\textbf{o}_j \right) \end{aligned}$$
(12)

Where \(\varvec{\mu }_j\) is calculated by Eq. 13.

$$\begin{aligned} \varvec{\mu }_j=\frac{\sum _{i=1}^{2 q} p_{i j}^* \textbf{f}_i+\sum _{\textbf{f} \in \textbf{f}_S, l(\textbf{f})=j} \textbf{f}}{s+\sum _{i=1}^{2 q} p_{i j}^*} \end{aligned}$$
(13)

After N iterations, the rows of \(\textbf{P}^*\) indicate class probabilities, with the maximum value determining the class label assigned to unlabeled samples, denoted as \(\hat{\textbf{Y}}\).

Transductive learning typically outperforms inductive learning in few-shot learning tasks, as evidenced by several leading methods cited in references  [39, 40, 77, 100] that leverage the Sinkhorn algorithm, a transductive method. Eq. 8 imposes a priori constraints on the number of samples in each class. In comparison, the NCM and K-means classifiers do not adequately integrate prior knowledge and constraints necessary for adversarial example detection. Therefore, the MAP algorithm, informed by the Sinkhorn algorithm and enriched with a priori knowledge, stands out as a more effective option for detecting adversarial examples during the meta-testing phase (refer to Sect. 4.5 for ablation experiments with different class-center calculation methods). Additionally, the inclusion of the entropy regularization term in the optimization objective facilitates faster convergence.

For the query set of \(K_{\textrm{test}}\) tasks, labels are predicted and evaluation metrics are computed. Subsequently, the evaluation metrics for the entire task are averaged to obtain the final evaluation result of the detector. The meta-testing algorithm is shown in Algorithm 2.

Algorithm 2
figure b

PAD meta-testing procedure

3.6 Threat Model

In the weakest defense setting, the defender is aware of the attack and can use the generated adversarial examples for training, i.e., the same type of adversarial examples are employed for both training and testing. This is referred to as attack-aware black-box detection [61]. Traditional supervised adversarial example detection algorithms are evaluated under this setting. In real attack scenarios, it is challenging for the defender to know the adversary’s strategy. However, a limited number of adversarial examples can be labeled based on the system’s actual response or manual means. Consequently, this paper assesses the detector’s capability under this setting.

We consider two different threat models according to the adversary’s knowledge of the defender [11, 14, 68]: oblivious adversaries and adaptive white-box adversaries.

Under the oblivious attack, adversaries are unaware of the detector and generates adversarial examples using an unprotected target model. This paper evaluates the detector’s performance in two cases: the cross-adversary benchmark and new adversarial attack scenarios. The benchmark employs the setting of [57] and includes less common attack types, such as EAD [15], SA [37], and STA [24], simulating unknown adversarial scenarios. Additionally, two new adversarial attacks, AA [19] and CAA [62], are incorporated to assess the detector’s detection capability. This paper adopts the searched strategies of [62], CAA-\({L_\infty }\), CAA-\(L_2\), and CAA-unrestricted, as shown in Table 1.

Table 1 Types of adversaries under the oblivious attack

Under the adaptive white-box attack, adversaries possess full knowledge of the original classification model’s parameters, training strategy, and the detector. Consequently, the adversary can target both the original classification model and the detector, leading to misclassification and detector evasion. Building upon the concept from [11], we combine the original classification model and the detector into a single model. We then employ C &W attack to generate white-box adversarial examples, with the combined model represented as follows:

$$\begin{aligned} G(\textbf{x})_i={\left\{ \begin{array}{ll} Z_C(\textbf{x})_i&{} \textrm{if} \ i\le N\\ 2\times Z_{D_K}(\textbf{x})\times \max _jZ_C(\textbf{x})_j&{} \textrm{if} \ i=N+1\\ \end{array}\right. } \end{aligned}$$
(14)

\(Z_C\left( \textbf{x} \right) _i\) represents the logits output of the classification layer in the original classification model. Diverging from [57], which only attacks the master network of the meta-learner in MetaAdvDet before fine-tuning, this paper employs a more potent attack setting, i.e., dynamically generating adversarial examples for the fine-tuned model on each task. \(D_K\) denotes the fine-tuned detection model for the Kth task, and \(Z_{D_K}\left( \textbf{x} \right) \) signifies the logits of input examples classified as adversarial by this model. If \(Z_{D_K}\left( \textbf{x} \right) \) exceeds 0.5, the output class of \(G\left( \textbf{x} \right) _i\) will be \(N+1\); otherwise, it will match the original classifier output.

4 Experiment

4.1 Datasets and Settings

4.1.1 Datasets

In the experiments, we utilized three widely-used datasets: MNIST [48], CIFAR-10 [46], and ImageNet [21]. Due to computational resource limitations and time constraints, we used a subset of ImageNet, the ImageNette dataset [38], which consists of 10 categories with 9469 training images and 3925 test images. The MNIST and CIFAR-10 retain their default image sizes, while ImageNet images are uniformly scaled to 224\(\times \)224\(\times \)3.

4.1.2 Target Models

For MNIST and CIFAR-10, we employed a 4-layer convolutional network (Conv-4)Footnote 1 as the target model. It is trained for 100 epochs on the training set using the Adam optimizer with a learning rate of 0.001, resulting in test set accuracies of 98.87% and 82.83%, respectively. For the ImageNet dataset, we utilized the ResNet-50 network with the same training parameters as in the MNIST and CIFAR-10 datasets, achieving a test set accuracy of 80.36%. Target models are trained without data augmentation techniques, employing only label smoothing with a smoothing factor of 0.1.

4.1.3 Attack Parameter Settings

We employed 15 attack methods from the CleverHans library [69] for the meta-training stage and the cross-adversary benchmark, using the default settings from [57]. The JSMA attack on ImageNet and AA are implemented using the ART toolbox [67], while CAA employed the algorithm provided by [62]. For AA on CIFAR-10 and ImageNet, \(\epsilon = 0.03\), while \(\epsilon = 0.05\) on MNIST, maintaining a consistent maximum of 100 iterations for both. Based on the attack strategies searched by [62], CAA-\({L_\infty }\) and CAA-\(L_2\) are conducted on CIFAR-10 and ImageNet, while CAA-unrestricted is conducted on ImageNet with the parameters detailed in Table 1. For the adaptive white-box attack, a confidence level of 0.3 and a maximum of 100 iterations are used.

4.1.4 Detector Parameter Settings

Our detector employs a prototypical network approach suitable for deeper residual networks, such as ResNet-10, while MetaAdvDet employs a 3-layer convolutional network (Conv-3), where a deeper residual network might degrade performance, as observed in our experiments. The adversarial example detection task applies a two-way setting for the detector to distinguish adversarial examples from normal ones, with one or five samples (s) per category in the support set and 35 and 15 samples (q) in the query set during the meta-training and meta-testing stages, respectively. The total number of tasks is set to \(K_{\textrm{train}}\) = 20,000 for meta-training and \(K_{\textrm{test}}=1000\) for meta-testing. The number of max iterations in the meta-training stage is set to 10. The prototypical network training utilizes the Adam optimizer, a default learning rate of 0.001, and a cross-entropy loss function. Following the settings in [39], the hyperparameters for the meta-testing stage are set to \(\lambda =0.1\), \(\alpha =0.2\), and \(N=20\), respectively.

4.1.5 Evaluation Indicators

In our experiments, we employ three evaluation metrics–detection accuracy, F1 score, and AUC–to comprehensively assess the detector’s performance, rather than relying on a single metric. Contrary to [57], which designates normal samples as positive (label 1 for normal, label 0 for adversarial), we define positive samples as adversarial examples when calculating AUC, while other metrics follow the reverse. The final metrics, including detection accuracy, F1 score, and AUC, are obtained by averaging the results across all tasks in the meta-testing stage.

4.1.6 Methods of Comparison

To exhibit the efficacy of our proposed approach, a fair comparison is essential using the adversarial example dataset constructed in Sect. 4.1.1. The compared methods must detect new adversarial examples with limited samples. Consequently, they should be end-to-end, data-driven learning methods amenable to fine-tuning during the testing stage.

The Baseline and Baseline++ methods [16] use the backbone network structure from Sect. 3.2 and are trained on a balanced adversarial example dataset randomly sampled from \(\mathcal {D}_{\text {meta\_train}}\) with equal numbers of adversarial and normal examples. Baseline and Baseline++ differ in that Baseline uses a linear layer after the backbone network as the classifier, while Baseline++ uses cosine distance. Both methods are trained for 50 epochs, the parameters of the backbone network are fixed during fine-tuning, and only the parameters of the classifier are trained. An SGD optimizer is used, with a learning rate of 0.01, and fine-tuning is performed with 20 iterations on the support set during the meta-testing stage. For MetaAdvDet, the default settings from [57] are employed for CIFAR-10 and ImageNet. A decaying learning rate setting is applied to MNIST, where the inner layer update learning rate is initialized to 1 and the outer layer update learning rate is set to 0.1 for the 1-shot setting, and to 0.1 and 0.01, respectively, for the 5-shot setting. The learning rate decays every 700 iteration steps to 1/10 of the original. PACA utilizes a two-stream architecture that leverages pixel artifacts and confidence artifacts for detecting adversarial examples subjected to both few-perturbation and large-perturbation attacks [14]. We further improve the PACA method to detect adversarial examples under few-shot conditions, using a pre-trained model provided by the authors with the default settings from [14]. During the testing stage, the entire network is fine-tuned on the support set with an update number of 20.

4.2 Attack Results Against Target Models

The construction of the adversarial example datasets involves only those samples that are correctly classified by the target model and successfully attacked by the adversary. The attack success rates on the validation and test sets are presented in Table 2. The results suggest that the attacks included in the cross-adversary benchmark achieve high attack success rates on all datasets, except for STA and SA. It is observed that AA is less effective on MNIST. Moreover, \(L_2\)-norm attacks, such as LBFGS and CAA-\(L_2\), have low attack success rates on ImageNet.

Table 2 Attack success rate of unknown adversarial attacks on different datasets

4.3 Performance Under the Oblivious Attack

Table 3 presents the results of various methods in the cross-adversary benchmark. Our proposed methods achieve optimal detection results across all datasets and settings. On MNIST, all methods achieve high detection performance under the 5-shot setting, while MetaAdvDet lags behind the Baseline method under the 1-shot setting. Baseline and Baseline++ demonstrate poor detection results on the complex CIFAR-10 and ImageNet datasets, with detection rates hovering around 50%, equivalent to a random detector. After transitioning from the 1-shot to the 5-shot setting, MetaAdvDet achieved a substantial improvement in detection performance by over 10%, slightly surpassing PACA on ImageNet, but significantly lower than both PACA and PAD on CIFAR-10.

Table 3 Detection results (%) of various methods in the cross-adversary benchmark in [57]

The AUC results for the detection of the five attacks in the cross-adversary benchmark and the two new adversarial attacks AA, CAA are shown in Figs. 3, 4 and 5. The results in Fig. 3 demonstrate that PAD outperforms all other methods on MNIST. Additionally, MetaAdvDet exhibits a significant gap with the other methods in detecting STA and AA.

Fig. 3
figure 3

AUC (%) score of unknown adversaries from the cross-adversary benchmark and novel adversaries under 1-shot and 5-shot settings on MNIST. a 1-shot; b 5-shot

The detection results on CIFAR-10 are presented in Fig. 4. PAD surpasses other methods in detecting the five attacks in the cross-adversary benchmark, though its performance on AA and CAA-\({L_\infty }\) is weaker than Baseline++, MetaAdvDet, and PACA. This weaker performance on \({L_\infty }\) attacks is likely due to the detector’s awareness of the attacks, suggesting that the three detection methods may be overfitting. Additionally, PAD outperforms MetaAdvDet in detecting CAA-\(L_2\) attacks, with a gap of around 5% compared to PACA.

Fig. 4
figure 4

AUC (%) score of unknown adversaries from the cross-adversary benchmark and novel adversaries under 1-shot and 5-shot settings on CIFAR-10. a 1-shot; b 5-shot

Figure 5 presents the detection results on ImageNet. PAD demonstrates inferior performance compared to MetaAdvDet in detecting SA; however, it surpasses all other methods in detecting a variety of unknown and new attacks. Although PACA outperforms state-of-the-art detection approaches [14] in detecting EAD with few-perturbation and low confidence levels, PAD and MetaAdvDet are superior in the few-shot detection task presented in this study, with PAD exceeding PACA by over 20%.

Fig. 5
figure 5

AUC (%) score of unknown adversaries from the cross-adversary benchmark and novel adversaries under 1-shot and 5-shot settings on ImageNet. a 1-shot; b 5-shot

The results from three datasets reveal that the presented method enhances robust generalization, accurately characterizes the adversarial subspace [61], and excels in detecting unknown attacks.

4.4 Performance Under the Adaptive White-Box Attack

A targeted white-box C &W attack is executed on a set of tasks, constructed using the methodology described in Sect. 3.3. These tasks involve random samples from the C &W adversarial example dataset combined with normal test samples. During the attack against each task, normal samples from the query set serve as initial inputs, with target labels differing from both the ground truth and the target model’s predicted label.

Samples successfully attacked against the combined model are utilized as adversarial examples for task construction, with the detection model fine-tuned on each task to detect these samples. The detection results are presented in Table 4. The table illustrates that the attacked PAD model, when provided with a small number of labeled attacked adversarial examples, improves detection performance across all datasets, achieving optimal detection results.

Table 4 Detection results (%) of various methods under the adaptive white-box attack scenario

4.5 Ablation Study

To assess the effectiveness of each PAD module, ablation experiments are conducted on CIFAR-10 and ImageNet using F1 scores (Table 5) under the cross-adversary benchmark. The table illustrates the impact of the first convolutional layer’s structure, CL2N, and MAP usage. One convolution with a kernel size of 7\(\times \)7 in the first layer outperforms three 3\(\times \)3 convolutions, regardless of average pooling. Notably, removing average pooling significantly improves CIFAR-10 detection results, indicating larger kernel sizes and unpooled structures benefit the detector’s generalization. MAP enhances the detector’s performance under the 1-shot setting, with CL2N providing a slight additional improvement.

Table 5 Ablation study results of PAD modules in the cross-adversary benchmark on CIFAR-10 and ImageNet

In Table 6, we present a comparative analysis of the impacts of various class-center calculation methods on the CIFAR-10 and ImageNet datasets. The NCM approach directly leverages support set samples for the computation of class centers. Conversely, the K-means method employs the assignment probability computation method described in Eqs. 57. This method initializes the class centers using the support set and subsequently updates them through an iterative process resembling the expectation–maximization (EM) algorithm. On the other hand, the MAP method utilizes Sinkhorn’s algorithm to compute assignment probabilities and updates class centers through Eqs. 12 and 13. An examination of the table reveals that the MAP algorithm consistently yields the most optimal detection performance.

Table 6 Ablation study results with different class-center calculation methods in the cross-adversary benchmark on CIFAR-10 and ImageNet

To investigate the impact of backbone network depth on detection performance, we incrementally increase the feature backbone depth from Conv-3 to Conv-4, ResNet-10, 18, 34, 50, and 101. AUC scores for CIFAR-10 and ImageNet are presented in Fig. 6. The detection performance exhibits a clear trend with increasing network depth, where it plateaus at ResNet-10 and subsequently declines, except in the 5-shot CIFAR-10 setting. Experiments reveal that deeper networks in MetaAdvDet do not improve performance but increase training costs; thus, they are omitted. PAD is suitable for deeper structures, and ResNet-10 outperforms other architectures across various datasets and settings.

Fig. 6
figure 6

Ablation study results of backbone networks with different depths. a CIFAR-10; b ImageNet

4.6 Time Cost

Regarding the complexity of implementation, we focus on the time costs involved in three phases: the preparation of adversarial examples, model training, and model inference. For the inference phase, our method consists of two primary processes: feature extraction and MAP. The complexity of the feature extraction is equivalent to that of a standard CNN. In contrast, the complexity of the MAP process is closely related to the Sinkhorn algorithm and depends on both the feature dimension and the number of samples. All timing measurements were conducted using a single Nvidia Tesla V-100 GPU. Given that the principal objective of our research is the enhancement of detection performance, we did not engage in comparative analyses regarding the time consumption of our method against other approaches.

  1. 1.

    Adversarial Examples Generation Time: The generation of adversarial example datasets is generally completed within a few hours. Exceptions include EAD, JSMA, and LBFGS types due to their higher time demands. We employed identical parameters for EAD and JSMA as those in the MetaAdvDet source code for consistency. To reduce computational demands, we adjusted only the attack strength for generating LBFGS adversarial examples on ImageNet. On CIFAR-10, adversarial examples took 20.5 h (EAD), 15.5 h (JSMA), and 57.3 h (LBFGS) to generate. On ImageNet, these times were 89.5 h (EAD), 32.2 h (JSMA), and 22.7 h (LBFGS).

  2. 2.

    Detector Training Time: In the meta-training stage, each task in the 1-shot setting includes 72 samples from both support and query sets, increasing to 80 samples in the 5-shot setting. The training times per epoch for MNIST, CIFAR-10, and ImageNet adversarial example datasets were 5.41, 7.32, and 74.43 min (1-shot) and 6.08, 7.51, and 81.64 min (5-shot), respectively.

  3. 3.

    Inference Time: At the meta-testing stage, each task consists of 32 samples (1-shot) and 40 samples (5-shot). Inference times per task for MNIST, CIFAR-10, and ImageNet were 9.22 ms, 9.47 ms, and 10.14 ms (1-shot) and 23.69 ms, 24.92 ms, and 27.63 ms (5-shot), respectively. The MAP algorithm primarily dictates the inference time.

5 Conclusions, Limitations, and Future Work

While much work focuses on detecting known adversarial examples by treating them as noise, this paper considers adversarial examples as features, proposing an end-to-end detection method called PAD to enhance the detector’s generalization capability across various scenarios. PAD employs a convolution with a larger kernel size and omits the pooling module in the first layer, maximizing feature extraction across varying resolutions. The prototypical network is trained on a set of tasks containing known adversarial examples. After applying the CL2N feature transformation, the features of the support set, which come from the unknown adversarial example detection tasks, are used as initial class centers. The optimal-transport inspired MAP algorithm is then used to update the class centers and calculate probabilities, significantly improving few-shot detection performance. Extensive comparative experiments demonstrate PAD’s improved generalization to unknown attacks and robustness against adaptive white-box attacks, given a limited number of labeled adversarial examples.

During meta-testing stage, our method detects adversarial examples through feature extraction and the MAP process. Feature extraction, akin to an ordinary DNN, requires one forward propagation. The MAP process, more time-intensive, computes the optimal match, impacting inference efficiency. The MAP method assumes a uniform class prior, which may hinder performance on class-imbalanced tasks, as shown in few-shot learning experiments [103]. Furthermore, our adversarial example detection tasks, containing only one adversarial type per task, may struggle when multiple types are mixed in a single task, potentially affecting class center computation and detection effectiveness. The proposed method, which depends exclusively on image information, demonstrates diminished effectiveness against known attack types, such as L-infinity. In contrast, the PACA method, leveraging both image data and confidence scores from the target classifier, exhibits superior performance in detecting known adversarial examples on CIFAR-10 dataset.

Future research will explore incorporating additional information beyond the image to enhance the detector’s generalizability, without compromising known adversarial example detection. For the purposes of consistency in benchmark testing, adversarial examples tailored to CNN architectures were employed. The rising prominence of transformer architectures in robustness research presents an intriguing direction, especially considering the distinct perturbations against transformers compared to CNNs [79]. By integrating the transformer architecture into our detection backbone, we anticipate improvements in feature extraction capabilities and the provision of adaptive defenses against white-box attacks. In addition, it is important for the detector to be able to detect adversarial examples across different architectures. This requires using adversarial examples generated against one architecture for meta-training and those generated against another architecture for meta-testing.