1 Introduction

Deep learning models have made significant strides in a variety of fields. Yet, their sensitivity to subtle perturbations has been exposed by the presence of adversarial examples, which typically remain undetected by human observation. (Szegedy et al. 2013; Shaham et al. 2018; Li et al. 2018; Kim et al. 2023b). Adversarial examples, generated by introducing subtle distortions to original inputs, can significantly alter the output of deep learning models. These examples differ minimally from the original images, yet these small variances are amplified in the results of deep learning models. Since these perturbations are unnoticed by humans, they pose security risks in practical applications of deep learning technologies. Consequently, crafting defensive algorithms to counteract these adversarial attacks is crucial for the safe deployment of deep artificial intelligence systems.

Adversarial samples are generated by solving optimization problems. Since the first appearance of adversarial samples, many attack methods have been proposed, including the fast gradient sign method (FGSM) (Goodfellow et al. 2014), iterative FGSM (Kurakin et al. 2016), deep fool (Moosavi-Dezfooli et al. 2016), Carlini and Wagner (C &W) attack (Carlini and Wagner 2017), and projected gradient descent (PGD) (Madry et al. 2017). Based on their simple formulations, adversarial attacks are feasible in many tasks, including face recognition (Sharif et al. 2016), reinforcement learning (Huang et al. 2017), audio classification (Kim et al. 2023d), object detection (Wang et al. 2020), and medical imaging (Li et al. 2020).

Therefore, many defense mechanisms for handling such adversarial attacks have been proposed. Some mechanisms exploit additional heuristics, such as test-time randomness (Guo et al. 2017; Dhillon et al. 2018), non-differentiable preprocessors (Xie et al. 2017; Samangouei et al. 2018), or detection of attacks (Martin and Elster 2020). Although these additional heuristics can defeat simple optimization-based attacks, recent studies have shown that such defenses can be easily defeated by stronger adversaries (Athalye et al. 2018). Recently, there have been works related to the smoothness of the deep learning models (Kim et al. 2023c; Lee et al. 2021a; Kim et al. 2023b; Stutz et al. 2021)

Another widely used approach is adversarial training (Goodfellow et al. 2014; Madry et al. 2017; Liu and Chan 2022), where adversarial samples generated intentionally during training are used as training inputs. Adversarial training is easy to implement and has not yet been completely defeated. However, adversarial training requires a specific attack algorithm (e.g., FGSM) to generate adversarial training samples and may exhibit weak generalization ability for other adversarial samples. Despite the passage of years since the proposal of PGD-based adversarial training, as referenced in Madry et al. (2017), Croce et al. (2020), it remains the leading method of defense, albeit with less than optimal performance.

Recently, many studies have attempted to improve the performance of adversarial training by introducing additional regularizers, such as the \(L_2\) loss between logits for a pair of clean and adversarial examples (Kannan et al. 2018), rectified linear unit (ReLU) stability regularizers (Xiao et al. 2018), and domain adaptation loss (Song et al. 2018).

In this paper, we consider the problem of adversarial attacks from the perspective of domain adaptation. Domain adaptation is an aspect of transfer learning that attempts to train a model using labeled source domain data that performs well on a given set of target data. It assumes that two domains are defined for the same task, but with different distributions. Because domain adaptation handles the problem of two domains with different distributions, it is closely related to adversarial robustness. Even though adversarial noise is typically imperceptible to humans, the distributions of adversarial samples in a high-level representation space differ significantly from those of original images (Fig. 1). To construct a model robust against adversarial attacks, it is important to handle distribution distances in a high-level representation space.

Fig. 1
figure 1

Illustration of the differences between an adversarial image and an original image. Adversarial perturbations are so small that they are often imperceptible to humans. However, adversarial noise is amplified through the layers of a network, which maximizes the distance between an original sample and an adversarial sample in high-level representations (logits). As a result, the network incorrectly classifies the adversarial image “Panda" as “Boxer”. We reduce the differences between the distributions of high-level representations to construct a model that is robust against adversarial attacks

Domain adaptation attempts to resolve the issue of different distributions between domains by using transferable representations, which cannot be distinguished by the representations of the source and target domains. By learning a representation that reduces the distance between two different domains, domain adaptation can construct a model that can be applied to two different domains. There are various approaches to minimizing the distance between two domains, such as maximum mean discrepancy (Long et al. 2017), \(\mathcal {H}\)-divergence (Ganin et al. 2016), KL divergence (Lee et al. 2021b), Wasserstein distance (Yoon et al. 2020)and Jensen Shannon divergence (Tzeng et al. 2017).

Inspired by the domain adaptation approach, we aimed to construct a model that can reduce the differences between distributions in a high-level representation space for original and adversarial images. As shown in Fig. 1, designing a classifier that can reduce the differences between logit distributions can suppress the influence of adversarial perturbations, leading to a model robust against adversarial attacks.

In this paper, we propose the sliced Wasserstein adversarial training (SWAT) method to design a classifier that provides consistent performance on clean and adversarial samples. We make the output logit distributions of clean and adversarial samples more similar by minimizing the Wasserstein metric (Redko et al. 2017; Frogner et al. 2015), which is a meaningful notion of dissimilarity between probability distributions. Although calculating Wasserstein distance can be computationally expensive, our approach based on sliced Wasserstein distance (SWD) uses a simple numerical solution to handle this problem. Recently, several studies have used SWD in various applications (Wu et al. 2019; Lee et al. 2019; Kolouri et al. 2018; Kim et al. 2023a). We also present new generalization bounds for adversarial samples that illustrate the need to reduce the Wasserstein distances between the logit distributions of clean and adversarial samples during adversarial training. The main contributions of this paper can be summarized as follows.

  1. 1.

    First, we propose a novel approach to aligning the output probability distributions of clean and adversarial data using the Wasserstein metric. We also present the SWAT method, which is a computationally efficient end-to-end network training method using SWD.

  2. 2.

    Second, we present the theoretical background motivating the SWAT method by providing generalization upper bounds for adversarial samples.

  3. 3.

    Third, we present empirical evaluations that demonstrate the robustness and accuracy of our method under various white box attacks.

2 Related work

2.1 Adversarial attack methods

Szegedy et al. (2013) demonstrated that small perturbations in original images can easily fool neural network models. In a follow-up paper (Goodfellow et al. 2014), a novel attack method called FGSM was proposed, which significantly reduced the computational time required to generate adversarial images using simple one-step back-propagation.

$$\begin{aligned} x_{adv} = x + \epsilon \text {sign}{(\nabla _x \mathcal {L}(\theta , x, y))} \end{aligned}$$
(1)

The symbols \(x, y, \theta\) and \(\mathcal {L}\) represent an input image, input label, network weights, and loss function, respectively. Using the above algorithm, one can obtain adversarial images that are denoted as \(x_{adv}\) within the \(\epsilon\)-norm area surrounding x.

One of the strongest types of adversarial attacks is PGD (Madry et al. 2017), which projects adversarial examples with a step size \(\alpha\) onto a set of allowable perturbations S in every iteration. This attack often reduces the accuracy of normal models to nearly zero.

$$\begin{aligned} x^{t+1} = \Pi _{\mathcal {B}(x,\epsilon )}(x^t + \alpha \text {sign}{(\nabla _x \mathcal {L}(\theta , x, y))}), \end{aligned}$$
(2)

where \(\Pi _{\mathcal {B}(x,\epsilon )}\) refers the projection to the \(\epsilon\)-ball \(\mathcal {B}(x,\epsilon )\).

2.2 Adversarial training

Various defense methods have been proposed to preserve the stability of deep learning models under the types of attacks described above. The most widely used defense method is adversarial training, which simply includes adversarial examples when training a model. The two most popular adversarial training methods use FGSM (Goodfellow et al. 2014), and PGD (Madry et al. 2017), respectively. The first method uses FGSM because it can generate adversarial samples quickly (Goodfellow et al. 2014). The second method formulates the empirical adversarial risk minimization problem as the following minimax problem (Madry et al. 2017):

$$\begin{aligned} \min _\theta \mathbb {E}_{(x,y)\sim \mathcal {D}} [\max _{\delta \in \mathcal {B}(x,\epsilon )}\mathcal {L}(\theta ,x+\delta ,y)] \end{aligned}$$
(3)

The inner maximization is approximated by a PGD attack with random restarts. In many previous studies, this method was determined to be effective, but it cannot defend against all adversarial samples. Above all, because adversarial training uses specific attack methods, the choice of which attack method to use is very important. We also attempt to use clean samples in addition to PGD adversarial training, as recommended in Kurakin et al. (2016). However, since PGD based adversarial training requires multiple steps of gradients, it suffers from computational burden.

Recently, many studies have focused on improving robustness (Madry et al. 2017; Ye et al. 2020; Drewek-Ossowicka et al. 2021; Cao et al. 2019). One such study resulted in a method called adversarial training domain adaptation (ATDA) (Song et al. 2018). The main concept of this method is to use an FGSM adversary as a target domain. Additionally, the authors exploit three types of loss to construct logit vectors from original images \(\phi (x)\) and adversarial images \(\phi (x_{adv})\). The three types of loss are covariance distance, maximum mean discrepancy (MMD) of mean vectors, and supervised domain adaptation loss, which consists of the intra-class variations and inter-class similarities of \(\phi (x)\) and \(\phi (x_{adv})\). Adversarial training attempts to assign clean and corresponding adversarial samples to the same class, but Kannan et al. (2018) proposed a method called adversarial logit pairing (ALP), which encourages logits from two images to be similar. Moreover, there have been approaches that tried to improve the adversarial training by redundant batches and cumulative perturbations (Shafahi et al. 2019), or uniform random initialization (Wong et al. 2020), or by avoiding catastrophic overfitting problems in single-step adversarial training (Kim et al. 2021). There has also been a trend of analyzing the smoothness of adversarial attacks (Lee et al. 2021a; Kim et al. 2023c; Liu and Chan 2022). However, there has not been a significant improvement in the performance of defense mechanisms yet.

To extend these adversarial-training-based approaches, we propose a novel design for distribution-matching adversarial training. The method in Song et al. (2018) requires calculating the covariance distance and MMD of each data pair and the optimization of three different complicated loss functions. In contrast, our approach is computationally efficient and easily converges. Both the method in Kannan et al. (2018) and our method attempt to minimize the distance between between two logits, but our method provides a tighter error bound.

Fig. 2
figure 2

Illustration of paired \(L_2\) regularizer and our proposed upper bound (\(W_p\)). Black points represent logits of normal samples while the red points represent the logits of adversarial samples. Left: effect of paired \(L_2\) regularizer, Right: effect of our upper bound (color figure online)

3 Wasserstein distance in robust training

3.1 Notations

We consider classification tasks in which \(\mathcal {X}\) is an input space and \(\mathcal {Y}=\{0,\ldots ,c-1\}\) is an output space. Given a hypothesis set \(\mathcal {H}=\{h:\mathcal {X} \rightarrow \mathbb {R}^{\vert \mathcal {Y}\vert }\}\), we define a classification network \(Q_{\theta }\in \mathcal {H}\) with parameters \(\theta\) that outputs logits \(Q_{\theta }(x)\). We denote \(\mathcal {D}_S = \langle \mathcal {D}_S^{\mathcal {X}},c^*\rangle\) and \(\mathcal {D}_A = \langle \mathcal {D}_A^{\mathcal {X}},c^*\rangle\) be the clean source and adversarial domains with the true concept (labeling function) \(c^*: \mathcal {X} \rightarrow \mathcal {Y}\), a clean source distribution as \(x \sim \mathcal {D}_S^{\mathcal {X}}\) and adversarial distribution \(x^{adv} \sim \mathcal {D}_A^{\mathcal {X}}\).

3.2 Wasserstein distance

For any \(p\ge 1\), the p-Wasserstein distance between probability measures \(\mu\) and \(\nu\) where \(\mu ,\nu \in \{ \mu : \int d(x,y)^p d\mu \le \infty , \forall y \in \mathcal {Z} \}\), is the p-th root of

$$\begin{aligned} W_p(\mu ,\nu )^p=\inf _{\pi \in \Pi (\mu ,\nu )}\mathbb {E}_{(x,y) \sim \pi }[d(x,y)^p], \end{aligned}$$
(4)

where \(\Pi (\mu ,\nu )\) is the set of all joint distribution whose marginals are \(\mu\) and \(\nu\). According to the Kantorovich duality theorem, the 1-Wasserstein distance can be simplified as

$$\begin{aligned} W_1(\mu ,\nu )=\sup _{f\in \textrm{Lip1}} \ \mathbb {E}_{z \sim \mu }[f(z)] - \mathbb {E}_{z \sim \nu }[f(z)] \end{aligned}$$
(5)

where \(\textrm{Lip1}\) is the set of real-valued 1-Lipschitz continuous functions on \(\mathcal {Z}\), i.e. \(\textrm{Lip1}\equiv \{f: \mathcal {Z}\rightarrow \mathbb {R} : \vert f(x)-f(y)\vert \le d(x,y), \forall x, y \in \mathcal {Z}\}\).

We propose to minimize the Wasserstein distance between two logit distributions \(Q_{\theta }\#\mathcal {D}^{{\mathcal {X}}}_S\) and \(Q_{\theta }\#\mathcal {D}^{{\mathcal {X}}}_A\), to build a robust model, respectively. We use the push-forward notation \(\#\) for transferring measures \(\mathcal {D}^{{\mathcal {X}}}_S\) and \(\mathcal {D}^{{\mathcal {X}}}_A\) on input space \(\mathcal {X}\) toward logit space \(\mathcal {Z}\) by using parametrized network \(Q_{\theta }\). Then the Wasserstein distance between two logit distributions can be written as

$$\begin{aligned}&W_1(Q_\theta \# \mathcal {D}^{{\mathcal {X}}}_S, Q_\theta \#\mathcal {D}^{{\mathcal {X}}}_A) \nonumber \\&\quad = \sup _{f\in \textrm{Lip1}} \mathbb {E}_{z \sim Q_\theta \#\mathcal {D}^{{\mathcal {X}}}_S}[f(z)] - \mathbb {E}_{z \sim Q_\theta \#\mathcal {D}^{{\mathcal {X}}}_A}[f(z)]\nonumber \\&\quad = \sup _{f\in \textrm{Lip1}} \mathbb {E}_{x \sim \mathcal {D}^{{\mathcal {X}}}_S}[f(Q_{\theta }(x))] -\mathbb {E}_{x \sim \mathcal {D}^{{\mathcal {X}}}_A}[f(Q_{\theta }(x))] \end{aligned}$$
(6)

Wasserstein distance is weaker than many other distance metrics between probability distributions, such as Jensen–Shannon divergence and total variation distance. Furthermore, convergence with respect to the topology induced by Wasserstein distance is equivalent to convergence in a distribution. Therefore, it is not only an appropriate metric for the distribution space, but also has better convergence properties, particularly for distributions with low-dimensional supports (Arjovsky and Bottou 2017).

3.3 Upper bound on robust training

In this section, we present an upper bound on objective of robust training. Adversarial risk of a hypothesis \(h\in \mathcal {H}\) in a domain \(\mathcal {D}_S=\langle \mathcal {D}^{{\mathcal {X}}}_S,c^*\rangle\) is defined as follows:

$$\begin{aligned} \mathcal {R}_{robust}(h;\mathcal {D}_S) = \mathbb {E}_{(x,y) \sim \mathcal {D}_S}[\max _{x'\in \mathbb {B}(x)} l(h(x'),y)]. \end{aligned}$$
(7)

We implicitly use y as a label of the input x, i.e., \(y=c^*(x)\). The goal of robust training is to minimize the worst misclassification rate on the data domain \(\mathcal {D}_S\), using (7) with the \(0-1\) loss, i.e., \(l(\hat{y},y)=\varvec{1}\{\textrm{arg}\,\textrm{max}_i{\hat{y}_i}\ne y\}\). However, in training phase, we instead use the cross-entropy loss as a surrogate since the \(0-1\) loss is intractable (Hoffgen et al. 1995). Recent work on over-parameterized neural networks (Allen-Zhu et al. 2019) have shown that the loss function \(l_y\circ h\equiv l(h(\cdot ),y)\) is Lipschitz-smooth for all y, i.e., \(\vert l_y\circ h(x)-l_y\circ h(x')-\nabla _x l_y\circ h(x)^T (x'-x)\vert \le \frac{1}{2}L\Vert x'-x \Vert _2^2, \forall x', x \in \mathcal {X}\) for some constants L, and called \(l_y\circ h\) is L-smooth. The following theorem provides a new upper bound based on the combination of clean source data and the first-order adversarial data.

Theorem 1

Given a hypothesis \(h \in \mathcal {H}=\{h:\mathcal {X}\rightarrow \mathbb {R}^c\}\) satisfying that \(l_y\circ h\) is \(\beta\)-smooth. Let \(\mathcal {D}_S\) and \(\mathcal {D}_{A\vert h}\) be the clean source and the first-order adversarial domains with respect to the hypothesis h, respectively, and \({\epsilon }\) be an adversarial perturbation, the following inequality holds:

$$\begin{aligned}&\mathcal {R}_{robust}(h; \mathcal {D}_S) \nonumber \\&\quad \le \frac{1}{2} (\mathcal {R}_{S}(h) +\mathcal {R}_{A\vert h}(h) \nonumber \\&\qquad +\sqrt{\frac{c}{c-1}} W_1(h\#\mathcal {D}^{\mathcal {X}}_{A \vert h},h\#\mathcal {D}^{{\mathcal {X}}}_S) +\beta \Vert \epsilon \Vert _2^2), \end{aligned}$$
(8)

where \(\mathcal {R}_{S}(h)\equiv \mathbb {E}_{\mathcal {D}_S}[l(h(x),y)]\) and \(\mathcal {R}_{A\vert h}(h)\equiv \mathbb {E}_{(\tilde{x},y)\sim \mathcal {D}_{A\vert h}}[l(h(\tilde{x}),y)]\).

As a result, the upper bound on the adversarial risk can be decomposed into four parts. The first two terms are derived from the first-order adversarial samples and the source clean samples, respectively. The third term is the Wasserstein distance between logit distributions of the source and adversarial domains. As will be discussed later, our proposed method tries to minimized the terms in the upper bound. To compute the Wasserstein distance in the third term, we use SWD for computational efficiency.

3.4 Advantage of Wasserstein distance

In this section, we provide an analysis of using Wasserstein distance between normal logits and adversarial logits. To minimize the adversarial risk in our upper bound it is required to reduce the Wasserstein distance between logit distributions, which is the third term of the Theorem 1. From the perspective of matching two logit distributions, there were approaches (Kannan et al. 2018; Pang et al. 2020) that applied \(L_2\) distances with the paired logits, such as ALP. However in this paper, our suggested upper bound reduces the optimal transport cost rather than the paired \(L_2\) distance.

The \(L_2\) regularizer minimizes the (expected) difference between a pair of logits z (normal example) and \(z^*\) (corresponding adversarial example) as follows:

$$\begin{aligned} \mathcal {L}_{ALP}=\mathbb {E}_{{\mathcal {D}}}\left[ d(z,z^*)\right] =\int d(z,z^*) dp(z,z^*), \end{aligned}$$
(9)

The Wasserstein distance regularizer minimizes the optimal transport cost between two distributions of logits, where

$$\begin{aligned} W_1(\mu _z,\mu _{z^*})=\inf _{\pi \in \Pi }\mathbb {E}_{\pi } \left[ d(z,z^*)\right] =\int d(z,z^*) d\tilde{\pi }(z,z^*) \end{aligned}$$
(10)

where \(\mu _z\) and \(\mu _{z^*}\) are the measure for normal logits and adversarial logits, and \(\tilde{\pi }\) is an optimal plan for the transport between \(\mu _z\) and \(\mu _{z^*}\). The difference is which transport plan is used between \(\mu _z\) and \(\mu _{z^*}\). Therefore, \(W_1(\mu _z,\mu _{z^*})\le \mathcal {L}_{ALP}\) holds, implying that the Wasserstein regularizer has a tighter bound than that of paired \(L_2\) regularizer.

From more intuitive perspective, paired \(L_2\) regularizer tries to match the \(Q_{\theta }(x^{adv}_{i})\) to the corresponding normal logits \(Q_{\theta }(x_{i})\). On the other hand, since our upper bound tries to find the optimal plan between \(\mu _z\) and \(\mu _{z^*}\), it tries to match the samples \(Q_{\theta }(x^{adv}_{i})\) to the nearest normal logits \(Q_{\theta }(x_{j})\).

Comparison between paired \(L_2\) regularizer and our upper bound is illustrated in Fig. 2. The left figure visualizes the training procedure of paired \(L_2\) regularizer while the right figure shows our Wasserstein based upper bound. The black points represent the logits of normal samples while the red points are the logits of adversarial samples. In Fig. 2, we can find that ALP focuses on matching the paired sample \(z_i\) and \(z'_i\), while our upper bound focuses on matching the global distribution of \(\mu _{z}\) and \(\mu _{z^{*}}\) by minimizing the optimal transport.

Since our proposed method tries to reduce the optimal transport between \(\mu _z\) and \(\mu _{z^*}\), it can prevent the over-regularization. For example, in Fig. 2, the logits of adversarial sample \(z'_1\) can be robust if it can be embedded near the normal logit distribution \(\mu _z\). In this case, paired \(L_2\) regularizer tries to reduce the distance between \(z'_1\) and \(z_1\) and our proposed method reduces the distance between \(z'_1\) and \(z_2\). If the label of \(z_1\) and \(z_2\) is identical, reducing \(d(z'_1,z_2)\) can be easier to learn a robust embedding, and prevent over-regularization. We provide more analysis on this on real datasets in Sect. 5.2.

Fig. 3
figure 3

Illustration of the architecture of our proposed method. Our method is designed to reduce the SWD between two logits \(Q_{\theta }(x)\) and \(Q_{\theta }(x^{adv})\). By using SWD, we can reduce the Wasserstein distance between two measures based on linear projections with uniform measures on a unit sphere to perform end-to-end training

4 Proposed method

4.1 Sliced Wasserstein distance

To minimize the upper bound of robust training in Theorem 1, we need to compute the optimal transport between adversarial and normal logits, which is computationally expensive. In this paper, we propose using sliced Wasserstein distance (SWD) to approximate Wasserstein distance between two different distributions. SWD shares similar properties to the original Wasserstein distance, but easier to compute (Kolouri et al. 2018). It projects the higher-dimensional densities into set of one-dimensional distributions and compares the projected distributions via Wasserstein distance. Since SWD shares the same topology with Wasserstein distance in a compact set (Bonnotte 2013), for example, in the logit image space \(h(\mathcal {X})\) for a bounded domain \(\mathcal {X}=[0,1]^n\), we used SWD to empirically calculate the Wasserstein distance in Theorem 1.

The sliced Wasserstein distance between \(\mu\) and \(\nu\) can be defined as follows:

$$\begin{aligned} \text {SWD}(\mu ,\nu )\equiv \int _{\Omega } W (\mu ^w,\nu ^w) d\mu _{\Omega }(w), \end{aligned}$$
(11)

where \(\mu _{\Omega }\) is a uniform measure on the unit sphere \(\Omega\) such that \(\int _{\Omega } d\mu _{\Omega }(w) =1\), and the measures \(\mu ^w = w^{T}\mu\) and \(\mu ^w = w^{T}\mu\) are one-dimensional projections of the measure \(\mu\) and \(\nu\) onto the direction \(w\in \Omega\). Then we extend the definition to finite sets \(\mathcal {S}\) and \(\mathcal {T}\) as \(\text {SWD}(\mathcal {S},\mathcal {T})\equiv \text {SWD}(\mu _{{\mathcal {S}}},\mu _{{\mathcal {T}}})\) where \(\mu _{{\mathcal {S}}} = \frac{1}{\vert \mathcal {S}\vert }\sum _{s\in \mathcal {S}}\delta _s\) and \(\mu _{{\mathcal {T}}} = \frac{1}{\vert \mathcal {T}\vert }\sum _{t\in \mathcal {T}} \delta _t\) with the Dirac measure \(\delta _x\) centered on a point x.

The integration (11) for the finite set \(\mathcal {S},\mathcal {T} \in \mathbb {R}^p\) with the same cardinality \(\vert \mathcal {S}\vert =\vert \mathcal {T}\vert =n\) can be approximated as follows:

$$\begin{aligned} \text {SWD}(\mathcal {S},\mathcal {T})&=\text {SWD} (\mu _{{\mathcal {S}}},\mu _{{\mathcal {T}}}) \approx \frac{1}{\vert \hat{\Omega }\vert } \sum _{w\in \hat{\Omega }} W (\mu _{{\mathcal {S}}}^{w}, \mu _{{\mathcal {T}}}^{w})\nonumber \\&=\frac{1}{\vert \hat{\Omega }\vert } \sum _{w\in \hat{\Omega }} \sum _{i=1}^{n} \vert w^T s_{i,w}-w^T t_{i,w}\vert ^2, \end{aligned}$$
(12)

where \(\hat{\Omega }=\{w_j\}\) is a finite set of uniform samples from the \((p-1)\)-dimensional unit sphere \(\Omega\), \(s_i, t_i\) are elements of \(\mathcal {S},\mathcal {T}\), respectively, and the \(s_{i,w}, t_{i,w}\) are the rearrangement of \(s_i, t_i\) such that \(w^T s_{i,w}\le w^T s_{i',w}\) and \(w^T t_{i,w}\le w^T t_{i',w}\) for all \(i\le i'\) and \(w\in \hat{\Omega }\).

Unlike the original Wasserstein distance \(W(\mu _{{\mathcal {S}}}, \mu _{{\mathcal {T}}})\) between high-dimensional datasets \(\mathcal {S}\) and \(\mathcal {T}\), SWD uses one-dimensional linear projections \(\mu _{{\mathcal {S}}}^w\) and \(\mu _{{\mathcal {T}}}^w\) to measure distance. Because the computation of one-dimensional Wasserstein distance only requires sorting and computing the absolute distances between sorted pairs, SWD has a significantly lower computational cost than original Wasserstein distance and it enables end-to-end learning using a single deep learning classifier network. In our experiments, we used a number of projection samples \(\vert \hat{\Omega }\vert =10\).

Therefore, since SWD has the same topology with Wasserstein distance in a compact set, we can provide a new upper bound using SWD. Using the inequality in Theorem 5.1.5 of Bonnotte (2013), the upper bound on the objective of robust training using SWD becomes the following corollary.

Corollary 4.1

Under the same condition with Theorem 1, for a constant \(C_c\) the following inequality holds:

$$\begin{aligned}&\mathcal {R}_{robust}(h; \mathcal {D}_S) \nonumber \\&\quad \le \frac{1}{2}\bigg( \mathcal {R}_{S} (h)+\mathcal {R}_{A\vert h}(h) \nonumber \\&\qquad + C_c \sqrt{\frac{c}{c-1}} \text {SWD} (h\#\mathcal {D}^{\mathcal {X}}_{A\vert h},h \#\mathcal {D}^{{\mathcal {X}}}_S)^{1/(c+1)} +\beta \Vert \epsilon \Vert _2^2 \bigg) . \end{aligned}$$
(13)

Recently there have been concerns about SWD since it might not approximate the true Wasserstein distance as the dimension increases. However, since we have matched the distribution between two logit distributions, which is not high-dimensional, SWD could successfully approximate the true distribution. In Sect. 5.4, we provide more details related to the approximation.

4.2 Sliced Wasserstein adversarial training (SWAT)

In this section, we introduce how we trained the suggested model empirically in real dataset. At the beginning of training, we sample a mini batch of data \(B=\{x_i,y_i\}_{i=1}^{m}, (B^X = \{x_i \}_{i=1}^{m})\) and from a clean dataset where m is the size of the batch and define the corresponding set as \(\{X\}\). Using an adversarial attack, we generate adversarial data \(B_{adv} =\{x^{adv}_i,y_i\}_{i=1}^{m}, (B^{X}_{adv} = \{x_i \}_{i=1}^{m})\) in each epoch. In this paper, we used FGSM method to generate adversarial samples.

Initially, we apply the supervised loss function for both clean data \(\{x_i,y_i\}\) and adversarial data \(\{x^{adv}_i,y_i\}\) for classifier \(Q_{\theta }\), and define the loss function as

$$\begin{aligned}{} & {} \mathcal {L}_S = \frac{1}{m} \sum _{i=1}^{m} l(Q_{\theta }(x_i),y_i)\\{} & {} \mathcal {L}_A = \frac{1}{m} \sum _{i=1}^{m} l(Q_{\theta }(x_i^{adv}),y_i). \end{aligned}$$

Next, we attempt to minimize the Wasserstein distance between probability distributions of the logit \(Q_{\theta }(B^X)\) and \(Q_{\theta }(B^{X}_{adv})\) in order to design a classifier that can perform consistently on both adversarial data and clean data. We formulate the loss function using SWD as follows:

$$\begin{aligned} \mathcal {L}_{SWD}&= \text {SWD}(\mu _{B^X},\mu _{B^{X}_{adv}})\nonumber \\&=\frac{1}{\vert \hat{\Omega }\vert }\sum _{w\in \hat{\Omega }} \sum _{i=1}^{n} \vert w^T Q_{\theta }(x_{i,w})-w^T Q_{\theta } (x^{adv}_{i,w})\vert ^2, \end{aligned}$$
(14)

During the optimization phase, we combine the adversarial training loss function and SWD loss function as follows:

$$\begin{aligned} \mathcal {L}_{total} = \mathcal {L}_{S} + \mathcal {L}_{adv} + \lambda \mathcal {L}_{SWD} \end{aligned}$$
(15)

where \(\lambda\) is a hyperparameter for balancing the regularization term. The first and second term in Eq. (15) is supervised loss function on clean dataset and adversarial dataset that is corresponding to the first two terms of Theorem 1. We optimized the classifier \(Q_{\theta }\) by minimizing the loss function \(\mathcal {L}_{S}\) and \(\mathcal {L}_{A}\) in the single batches B and \(B_{adv}\) iteratively. The third term of equation (15) is the sliced Wasserstein distance between logit distributions of clean and adversarial datasets that is related to third term in our Theorem 1. We summarize our framework in Algorithm 1 and illustrate its overall architecture in Fig. 3.

Algorithm 1
figure a

SWAT(Sliced Wasserstein Adversarial Training)

5 Experiments

5.1 Dataset and model architecture

In this section, we evaluate our method using three standard classification benchmark datasets. The CIFAR-10 dataset (Krizhevsky and Hinton 2009) consists of 50,000 training images and 10,000 testing images. The size of each image is 32 \(\times\) 32 \(\times\) 3 and the dataset contains 10 classes. SVHN (Netzer et al. 2011), which was obtained from house numbers in Google Street View images, is a digit classification dataset with an image dimension of 30 \(\times\) 30 \(\times\) 3. It contains 73,257 images for training and 26,032 images for testing with 10 classes (one class for each digit). Fashion-MNIST (Xiao et al. 2017) contains 28 \(\times\) 28 grayscale images with 10 label classes, where each class denotes one fashion item category, such as “t-shirt” or “sneakers.” This dataset contains 60,000 images for training and 10,000 images for testing. We have summarized the structure of our deep learning structure in Table 1. For each dataset, we constructed different architectures to ease comparisons to other state-of-the-art methods as follows:

Fig. 4
figure 4

Accuracy of white box attacks (FGSM, PGD, and C &W) in three standard dataset(CIFAR 10, SVHN, and Fashion-MNIST) in test set. X axis refers to perturbation level (\(\epsilon\)) and Y axis is accuracy (%) (Best viewed in color) (color figure online)

Fig. 5
figure 5

Certified Accuracy in \(l_2\) norm using randomized smoothing (Cohen et al. 2019) . We followed the color scale same as Fig. 4 (Best viewed in color) (color figure online)

Table 1 Architecture of our deep learning model

5.2 Comparison methods

We compared our method to the following six baseline methods. (1) Normal: Basic model that uses only clean training data with a classification loss function. (2) AT (PGD): adversarial training using PGD adversarial samples (Madry et al. 2017). (3) ATDA: ATDA training (Song et al. 2018), with a regularization hyperparameter of \(\frac{1}{3}\). (4) ALP: ALP training (Kannan et al. 2018) with the same logit pairing weight of 0.5 for all data. (5) Free: Free single step adversarial training (Shafahi et al. 2019). (6) Fast: Fast adversarial training (Wong et al. 2020) (7) SSAT: Single step adversarial training (Kim et al. 2021). (8) Ours: the proposed method using sliced Wasserstein distance. (9) Ours\(^*\): Our proposed model with additional labeling information.

Fig. 6
figure 6

Visualization of normal samples and adversarial samples in CIFAR 10 dataset

To push further, we also consider an additional modified version of the proposed method, Ours\(^*\). In Fig. 6, we have shown the projected normal logits \(Q_{\theta }(x_i)^{w}\) in the bottom and the projected adversarial logits \(Q_{\theta }(x^{adv}_i)^{w}\) in the top, where the colors of each points represents the label. The black line links between the paired samples in ALP (left), and the optimal transport in Ours (right). Compared to ALP, which tries to reduce distance even if the corresponding sample is far, our method reduces the distance of the adversarial samples to the nearest normal sample. Moreover, from Fig. 6 we can find that most of the samples in a single batch have been matched with the samples that have the same label. However, in some settings, SWAT might suffer from matching the different labeled samples during training. Therefore, to remove the possibility that the samples can be matched with other samples with a different label, we suggest a variation of our proposed method. Ours \(^*\) also reduces the sliced Wasserstein distance between \(\mu _z\) and \(\mu _{z^*}\), however when finding the optimal transport plan, Ours\(^*\) considers label information. Therefore, it always matches the normal sample and adversarial sample that has the same label.

In our proposed method, we use \(\lambda =1\) for the CIFAR 10 dataset and \(\lambda =0.5\) for the SVHN and Fashion-MNIST datasets. For the CIFAR-10 dataset, we generated adversarial images using FGSM with \(\epsilon = 8/255, \alpha = \epsilon /4\) in the training phase. For PGD, we used seven iterations with a single random restart. For the SVHN dataset, we used FGSM with \(\epsilon = 0.02\), \(\alpha = \epsilon /10\) and PGD with 20 iterations and single random restart. Finally, we set \(\epsilon = 0.1\), \(\alpha = \epsilon /10\) and used 40 iteration steps with a single random restart for the Fashion-MNIST dataset.

Table 2 Average distortion metrics over successful adversarial examples generated by EAD attack. The distortion metrics are measured using three different metrics (\(L_1\), \(L_2\) , and \(L_{\infty }\))

5.3 Results

5.3.1 Classification performance under white box attacks

To evaluate the robustness of our method against adversarial attacks, we measured its classification accuracy under various distortion levels. We evaluated classification performance under three white box attacks: FGSM (Goodfellow et al. 2014), PGD (Madry et al. 2017), Carlini and Wagner (C &W) (Carlini and Wagner 2017) attacks, and EAD attacks (Chen et al. 2018).

FGSM: CIFAR 10: Distortion levels ranging from 0 to 10/255 with steps of 2/255. SVHN: \(\epsilon \in [0,0.025,0.005]\). F-MNIST: \(\epsilon \in [0,0.25,0.05]\)

PGD: We used the same distortion levels as those used for FGSM for each dataset. CIFAR 10: \(\alpha = \epsilon /4\) with 20 iteration steps. SVHN & Fashion-MNIST: \(\alpha = \epsilon /10\) with 20 iteration steps.

C &W: We used constant c values ranging from \(10^{-3}\) to \(10^{2}\) on a log scale of 10 for every dataset with 100 optimization steps.

EAD: We used nine binary search steps and run 1000 iterations with initial learning rate 0.01.

The test results are presented in Fig. 4. In this figure, one can see that our method exhibits performance similar to that of the other models (Fig. 4a, d, g) for FGSM attacks. Because every compared method exhibits decent performance under FGSM attacks, which we mainly used as adversarial samples during the training phase, we can assume that all of the models converged during the training phase.

However, in Fig. 4b and e, one can see that our method exhibits the highest robustness against strong PGD attacks. The results of the C &W attacks also demonstrate the robustness of our model against different white box attacks that were not used during the training phase.

Compared to ALP (cyan line in Fig. 4), our method exhibits similar results for FGSM attacks. However, the model performances on PGD and C &W attacks indicate that our method is better at aligning logits in the presence of unknown adversarial attacks. Compared to the PGD training model (light blue line in Fig. 4), our model performs better on the three datasets for strong PGD attacks. In Fig. 4h, the AT (PGD) model performs better than the other methods, but one can see that it fails on C &W attacks in Fig. 4i.

Empirical results demonstrate that while other models fail to construct a generalized defense model for adversarial attacks that were not used in the training phase, our model exhibits consistent results for various types of attacks. One can conclude that our method may exhibit robustness against unknown future adversarial attacks.

We have measured the robustness against EAD attack (Chen et al. 2018). Since the attack success rate of EAD in all three datasets was close to 100%, we have measured the distance between the original image and the EAD attacked adversarial image to evaluate the robustness against EAD attacks. We call these as distortion metrics, and the larger the distortion metric, the more robust the model. In this paper, we have measured the distance in three different metrics (\(L_1\) , \(L_2\), and \(L_{\infty }\)). We have summarized the average distortion metrics over successful EAD adversarial examples in Table 2. It is observed that our method has shown the best results in seven out of nine metrics.

Fig. 7
figure 7

Approximation of Wasserstein distance with SWD and GSW

5.3.2 Certified radius

While high classification accuracy under white box PGD attacks provides strong empirical evidence that the proposed model is robust against many types of adversarial attacks, it cannot guarantee robustness to norm-bounded attacks. Therefore, it is necessary to compute certified accuracy metrics to determine the effect of SWD regularization on the robustness of the classifier.

We evaluated certified accuracy by computing the certified radius proposed in Cohen et al. (2019). Certified accuracy is defined as the fraction of a test set in which no example is misclassified within the r-neighborhood. We used an induced randomized classifier g with normal noise with a variance of \(\sigma =2.0\) for Fashion-MINST and \(\sigma =0.25\) for the other methods. We selected 100 samples for classification and 105 samples for certified radius estimation for each type of test sample. To reduce computation time, we used 1/100 samples from the testing set to evaluate robustness.

As shown in Fig. 5, SWD regularization (red) improves certified accuracy compared to ALP (cyan) regularization and achieves better results than the other methods.

5.4 Approximation of Wasserstein distance

Recently, there have been concerns with sliced Wasserstein distance. In high dimensional settings, random projection might not capture the properties of the original distribution. In response, there was a research (Kolouri et al. 2019) that suggested generalized sliced Wasserstein distances (GSW) that uses an additional optimization that can better approximate the Wasserstein distance.

In this paper, we have used SWD between two logit distributions where the dimension is 10. Since it is not high-dimensional, using SWD can be an appropriate approach for efficiently computing Wasserstein distance. To show that using SWD is enough for approximating the Wasserstein distance in 10-dimensional distribution, we have calculated the Wasserstein distance using three different methods during training.

In Fig. 7, we provide the actual values of Wasserstein distances that were approximated by using linear OT approach, SWD, and GSW. In this figure, the x-axis denotes distribution samples and the y-axis represents the distance. We can find that SWD and GSW both have successfully approximated the Wasserstein distance (LP). Considering that computing the distance with SWD and GSW takes 0.385 s and 6.276 s respectively, using SWD was appropriate in our settings.

6 Conclusion

In this paper, we proposed a novel defense framework called SWAT that minimizes the Wasserstein distance between the logits of clean and adversarial data samples. We used SWD to design a computationally efficient end-to-end training framework that is robust to adversarial attacks. Empirical results demonstrated that our model is more robust than previous defense models on three standard datasets in terms of four different adversarial attacks and certified accuracy. Our method significantly outperformed previous methods against adversarial attacks that were not used for adversarial training and achieved the highest certified accuracy. Visualizations of the logit spaces of clean and adversarial samples indicated that SWAT successfully aligns output distributions.