1 Introduction

Whereas neural networks are being developed in a broad spectrum of applications with great success, they are not intrinsically robust. In particular, by imposing imperceptible but carefully chosen deviations on the inputs, which are also known as adversarial perturbations, the resulting images can drastically change the prediction of the neural network (Biggio et al., 2013; Pei et al., 2019; Goodfellow et al., 2014). Worse still, these crafted inputs can be transferred across different models, enabling the black-box attacks where the target model is even hidden from the attackers (Szegedy et al., 2014; Papernot et al., 2016, 2017). It is thus of great importance to ensure that the deployed models are robust and generalize to diverse input perturbations, especially in safety-critical and security-sensitive applications, e.g., autonomous driving and identity authentication system (Biggio & Roli, 2018; Mirjalili & Ross, 2017).

Since the first observation on the high vulnerability to adversarial examples of the neural networks, there has been a flurry of activity on crafting sophisticate adversarial perturbations (Gowal et al., 2020; Athalye et al., 2018; Carlini & Wagner, 2017), which has thus encouraged a great deal of investigation on building defenses against such perturbations (Goodfellow et al., 2015; Tramèr et al., 2018; Yan et al., 2018; Cissé et al., 2017). Among them, adversarial training is widely regarded as one of the most effective methods to achieve robustness, which trains robust models by generating adversarial examples at each training steps and injecting them into the training set (Goodfellow et al., 2015; Madry et al., 2018).

Despite great success of adversarial training and its variants, the accuracy of deep models on adversarial inputs is still far below that on normal inputs. This gap might be traced back to an increased sample complexity that induced by the non-convex nature of the min–max formulation for adversarial training and most of its variants (Schmidt et al., 2018). An intuitive scheme is to leverage more training data for greater robustness (Schmidt et al., 2018; Alayrac et al., 2019). However, this is impractical in many real world applications as the annotations and data efficiency challenges are further exacerbated in the context of adversarial training.

Self-supervised and unsupervised training techniques attempt to address this challenge by eliminating the need for manually labeled data. Among them, contrastive learning (CL) has advanced self-supervised representation learning and achieved state-of-the-art performance (Chen et al., 2020; Wu et al., 2018; Hénaff, 2020; van den Oord et al., 2018; Hjelm et al., 2019; He et al., 2020). Despite a recent surge in activity, CL-based self-supervised learning remains underestimated and has only recently been leveraged for gaining robustness. The prior work (Chen et al., 2020) is the first to incorporates adversarial training with self-supervised learning, nevertheless, it heavily relies on multiple ad-hoc pretext tasks. To address this issue, a new family of methods have been proposed (Jiang et al., 2020; Kim et al., 2020; Fan et al., 2021). Intuition behind these methods is to encourage the pre-trained representation to be robust through contrastive learning. However, it still leaves many unanswered questions, especially with respect to the inadequacy of class-discrimination, which is a characteristic issue of conventional self-supervised learning.

Our work attempts to make a rigorous and comprehensive study on addressing the above issues. Inspired by a successful discrimination enhancement solution (Khosla et al., 2020), we propose a framework for the robust pre-training, which advances the self-supervised contrastive learning by incorporating with a supervision for discrimination enhancement, and take aim at adversarial robustness.

Our method is motivated by two observations. First, contrastive learning approaches alone have already been able to acquire expressive representations of data, yet not adversarially robust. Second, adversarial self-supervised learning is somewhat effective for robust generalization while is lack of class-discrimination. Therefore, we adapt the contrastive learning scheme to adversarial examples for robustness enhancement, and also extend the self-supervised contrastive approach to the supervised setting for class-wise discrimination. In the empirical part, we verify the effectiveness of our novel framework on adversarial training benchmarks following the common protocols in Zhang et al. (2019), Chen et al. (2020). Concretely, we make the following contributions:

  1. 1.

    We propose ASCL, an Adversarial Supervised Contrastive Learning framework, which adapts the contrastive learning scheme to adversarial examples and further extends the self-supervised approach to supervised setting, leading to both adversarial robustness and class-discrimination.

  2. 2.

    We propose to simultaneously inject label-independent and label-based attacks into the pipeline, which are generated by self-supervised contrastive loss and supervised cross-entropy respectively, to further boost the generalization and calibration.

  3. 3.

    We verify the proposed ASCL through extensive evaluation under attacks of different setups and also in highly challenging scenarios, e.g., transferring across datasets, defending against diverse unforeseen corruptions. The proposed method shows consistent superiority, proceeding towards either the lightweight standard fine-tuning or adversarial fine-tuning, and on both quantitative and qualitative evaluations.

2 Related work

Our method is deeply rooted in the recent surge of studies on self-supervised representation learning, contrastive learning and adversarial training. Here we focus on the most relevant work.

2.1 Contrastive learning

Numerous approaches for self-supervised representation learning have been developed in recent years. Most adopt objective functions similar to those for the supervised learning, while train the models on handcrafted pretext tasks where the labels are derived from unlabeled data. Pretext tasks including rotation prediction (Gidaris et al., 2018), Jigssaw puzzle (Noroozi & Favaro, 2016), Selfie (Trinh et al., 2019) and region filling (Criminisi et al., 2004), heavily rely on heuristics and thus suffering from limited generality of representations.

A recently proposed family of self-supervised methods, contrastive learning (Chen et al., 2020; Tian et al., 2020b; He et al., 2020; Chen et al., 2020), have demonstrated impressive ability in learning generalizable representations. The general idea of contrastive learning is to acquire invariant representations by maximizing the agreement between positive samples while contrasting with the negatives, where the positives are different augmented views of the same sample. SimCLR (Chen et al., 2020) has been demonstrated a simply yet powerful contrastive learning framework for representation learning, on which we elaborate our formulation. Another work (Khosla et al., 2020) extended the self-supervised contrastive approaches to the supervised setting, for a fully utilization of label information. Despite existing literature on contrastive learning (Hénaff, 2020; Wu et al., 2018; Hjelm et al., 2019; Tian et al., 2020a; Bachman et al., 2019; Misra & van der Maaten, 2020) show improvement on either the generalization or discrimination, most of them perform standard training and therefore do not tackle adversarial attacks.

2.2 Adversarial training

There has been a flurry of activity on building defense against the adversarial attacks in recent years. Methods range from input denoising (Cissé et al., 2017; Guo et al., 2018; Liao et al., 2018), adversarial detection (Lee et al., 2018; Ma et al., 2018), and adversarial training (Madry et al., 2018; Wang et al., 2019; Kannan et al., 2018; Qin et al., 2019; Zhang et al., 2019; Wang et al., 2020). Among them, adversarial training is widely regarded as one of the most effective methods to achieve robustness. The classical version of adversarial training (abbreviated as AT for simplicity) was proposed by Madry et al. (2018), which has withstood intensive scrutiny and is so effective that it is the de facto standard for training models robust against adversarial examples (Gowal et al., 2020; Rebuffi et al., 2021). The main idea behind adversarial training is to trains robust models by generating adversarial examples at each training steps and injecting them into the training set (Goodfellow et al., 2015; Madry et al., 2018).

Another line of work Kannan et al. (2018), Zhang et al. (2019) have proposed consistency regularization to boost the adversarial robustness over (Madry et al., 2018) and its variants. One notable work, TRADES (Zhang et al., 2019), measures the Kullback–Leibler divergence on the softmax outputs for pairs of natural-adversarial images, providing sufficient theoretical guarantees for a better trade-off between standard and robust accuracy. Their success inspires our nature-adversarial contrastive views in robust pre-training.

2.3 Self-supervised adversarial training

Self-supervised learning has only recently been connected to the study of adversarial robustness. Several recent studies have attempted to use contrastive representation learning in adversarial robustness field (Chen et al., 2020), in the most straightforward way to inject adversarial samples yet yield unsatisfactory results. Other concurrent work (Jiang et al., 2020; Fan et al., 2021) improve the robustness by less aggressive adversarial contrastive views.

Despite better robustness, it still leaves unanswered questions on the class-discriminative ability, which is unacquirable in the self-supervised pre-training but is required by a robust prediction on downstream tasks. In the literature (Jiang et al., 2020; Kim et al., 2020), the final adversarial robustness on downstream tasks usually heavily relies on advanced techniques in the phase of fine-tuning, and thus makes the advantages of contrastive-based pre-training less significant. For example, Jiang et al. (2020) suggested an adversarial full fine-tuning, where the pre-trained model is only used as an initialization and all the weights require updating by adversarial training.

3 Our proposal

In this section, we present a novel framework for robust pre-training, which advances supervised contrastive learning in the adversarial scenario. We incorporate the supervision as a complementary objective, which is co-optimized with the self-supervised contrastive loss through adversarial training.

3.1 Preliminaries

3.1.1 Problem statement

For a L-class classification problem with a given dataset \(\mathcal {D} = \big \{ (x_i,y_i) \big \} _{i=1}^n \subseteq \mathcal {X} \times \mathcal {Y}\), the goal is to learn a classifier \(f_\theta : \mathcal {X} \rightarrow \mathcal {Y}\) that maps the image \(x \in \mathcal {X} \subseteq \mathbb {R}^d\) to one-hot label \(y \in \mathcal {Y} \subseteq \mathbb {R}^L\) according to certain quality metrics, such as the error probability \(\text {err}(f_\theta ):= \mathbb {P}_{(x,y)\sim \mathcal {P}_{x,y} } \big (f_\theta (x) \ne y \big )\) in standard training, where \(\mathcal {P}_{x,y}\) denotes the underlying joint distribution over (xy) pairs. While in the scenario of adversarial training, the aim is to train \(f_\theta\) to correctly classify not only x but also the adversarial perturbed data \(x+\delta\). Here \(\delta\) denotes the perturbation that subjects to \(\ell _p\)-norm budget as \(\Vert \delta \Vert _p \le \epsilon\).

3.1.2 Adversarial training

The high-altitude idea of adversarial training(AT) is to directly augment the training set with perturbed samples that generated on-the-fly, and thus the model becomes robust to such attacks (Goodfellow et al., 2015; Madry et al., 2018; Athalye et al., 2018). Essentially, the adversarial training can be formulated as an alternative min–max optimization, whose goal is to minimize the adversarial risk:

$$\begin{aligned} \mathop {\min }\limits _{\theta } \mathbb {E}_{(x,y)\sim \mathcal {P}_{x,y}} \left[ \max \limits _{\delta \in \mathbb {S}} \ell (f_{\theta }(x+\delta ),y)\right] , \end{aligned}$$
(1)

where \(f_\theta (\cdot )\) is the output vector of the \(\theta\)-parameterized learning model, x is the input image and y is the corresponding label-indicator vector. Let \(\mathcal {P}_{x,y}\) denotes the underlying joint distribution over (xy) pairs, and \(\mathbb {S}\) denotes the set of allowed perturbations. For example, for adversarial perturbations within the \(\epsilon\)-ball and bounded by \(l_p\)-norm, the adversarial set can be denoted as \(\mathbb {S}_p = \{ \delta \mid \Vert \delta \Vert _p \le \epsilon \}\) for \(\epsilon > 0\). The symbol \(\ell\) is a suitable classification loss (e.g., the 0–1 loss in the context of classification task).

More concretely, to find the optimum of the inner maximization problem above, which is NP-hard, Madry et al. (2018) proposed to approximately optimize the inner maximization by project gradient descent (PGD) method. They compute the perturbation in K gradient ascent steps of size \(\alpha\) as:

$$\begin{aligned} \begin{aligned} \hat{\delta }&= \delta ^{(k)} + \alpha \cdot \text {sign} \nabla _x \ell \big ( (f(x),y \big ) \\ \delta ^{(k+1)}&= \max \big ( \min (\hat{\delta },\epsilon ),-\epsilon \big ) \end{aligned} \end{aligned}$$
(2)

where \(\delta ^{(0)}\) is chosen at random within \(\mathbb {S}\), and the symbol \(\nabla _x\ell\) denotes gradient of the loss \(\ell\) with respect to the input image x. We will refer to this inner optimization procedure with K steps as \(\text {PGD-}{} \textit{K}\).

For each sample x with ground-truth label y, one of the most basic form of AT (Madry et al., 2018) replaces the non-differentiable 0–1 loss with the softmax cross-entropy loss \(\ell _{\text {xent}}\) and minimizes the loss given in (1) by the following implementation:

$$\begin{aligned} \mathcal {L}_{\text {AT}} := \max \limits _{\Vert \delta \Vert _p \le \epsilon } \ell _{\text {xent}}\big ( f_\theta (x+\delta ),y\big ). \end{aligned}$$
(3)

3.1.3 Contrastive-style adversarial training

Contrastive learning is an important class of the self-supervised learning algorithms, which is a powerful approach to learning effective representations for better performance or faster training on downstream tasks, without requiring labeled data (Wu et al., 2018; Chen et al., 2020; Hénaff, 2020; Hjelm et al., 2019; Tian et al., 2020a). The general idea is to learn effective representations by maximizing the agreement between different augmentations of the same sample, via a contrastive loss:

$$\begin{aligned} \ell _{\text {CL}}(\widetilde{x}_i,\widetilde{x}_j) = - \sum \limits _{i \in I} \log \frac{\text {exp} (z_i \cdot z_{j} /\tau ) }{ \sum \limits _{a \in A(i)} \text {exp} (z_i \cdot z_a / \tau )}, \end{aligned}$$
(4)

where \(\widetilde{x}_i\) and \(\widetilde{x}_j\) are augmented views of sample x, and \((\widetilde{x}_i,\widetilde{x}_j)\) denotes a contrastive pair. The notation \(i \in I \equiv \{ 1 \cdots 2N \}\) is the index of an arbitrary augmented view, j is the index of a different augmented view of the same sample, and \(A(i) \equiv I \backslash \{i\}\) is the set that with i excluded. The index i and j is called the anchor and the positive, respectively, with the other \(2(N-1)\) indices called the negatives. The symbol \(\cdot\) denotes the inner product, \(\tau > 0\) denotes a temperature parameter, \(z_k\) denotes the projected representation under the k-th augmented view, and \(\text {exp}(\cdot )\) denotes the exponential function.

When applying the contrastive learning to the field of adversarial defense, namely, robust pre-training, the optimization objective (1) is given by:

$$\begin{aligned} \min \limits _{\theta } \mathbb {E}_{x \in \mathcal {X}} \max \limits _{\Vert \delta \Vert _{p} \le \epsilon } \ell _{\text {CL}} (\widetilde{x}_i + \delta _i, \widetilde{x}_j+\delta _j), \end{aligned}$$
(5)

where the adversarial perturbation \(\delta _i\), \(\delta _j\) w.r.t. the contrastive view \(\widetilde{x}_i\), \(\widetilde{x}_j\) can be computed by the PGD-K optimization procedure (2) with the loss \(\ell\) designed as the contrastive loss \(\ell _{\text {CL}}\).

Generally, in the contrastive-style adversarial training, a supervised fine-tuning will usually immediately follow the self-supervised pre-training (5). Specifically, the supervised fine-tuning can be formulated as:

$$\begin{aligned} \min _{\theta _{\mathrm {c}}} \mathbb {E}_{(x, y) \in \mathcal {D}} \ell _{\mathrm {sup}}\left( \phi _{\theta _{c}} \circ f_{\theta }(x), y\right) , \end{aligned}$$
(6)

where \(\ell _{\text{ sup }}\) denotes a supervised loss (e.g., the cross-entropy loss), the notation \(\phi _{\theta _{c}} \circ f_{\theta }\) denotes the classifier that with a linear prediction head \(\phi _{\theta _{c}}\) (to be learned in the fine-tuning) on top of a feature encoder \(f_{\theta }\), which is learned in the pre-training phrase. Note that one can also apply the worse-case cross-entropy loss for the supervised fine-tuning, that is, using \(\phi _{\theta _{c}} \circ f_{\theta }(x+\delta )\) in (6) instead.

New challenge arises is that in the aforementioned self-supervised contrastive approaches, a class-discriminative ability specific to the target categories is unacquirable but is required by a robust prediction on downstream tasks. This might be attributed to a task mismatch from the self-supervised pre-training to the fully-supervised fine-tuning.

3.2 Proposed method: ASCL

3.2.1 Equips contrastive pre-training with class-wise discrimination

As we target for class-wise discrimination, we adopt a supervised recipe in the adversarial contrastive learning, as oppose to the self-supervised contrastive learning. While this is a simple extension to the self-supervised setup, it is non-obvious how to properly setup the training objective. On the one hand, the sample x is commonly restricted to be unlabeled data in the previous robust pre-training methods, as (4) is incapable of handling the labeled data. On the other hand, a supervised fine-tuning will usually immediately follow the robust pre-training, with the use of a supervised loss, e.g., cross-entropy loss or adversarial classification loss, and train over target dataset that contains labels. In what follows, we will elaborate a solution.

Specifically, we borrow ideas from the recently proposed supervised contrastive learning (SCL) techniques (Khosla et al., 2020), which has achieved substantial gains in discrimination over the conventional contrastive learning by extending the self-supervised contrastive learning to the supervised setting. The supervised contrastive loss with respect to the augmented views of the sample data x can be given by the following formulation:

$$\begin{aligned} \ell _{\text {SCL}} (\widetilde{x}_i , \widetilde{x}_p) = \sum \limits _{i \in I} \frac{-1}{\vert P(i) \vert } \sum \limits _{p \in P(i)} \log \frac{\text {exp} (z_i \cdot z_p / \tau )}{ \sum _{a \in A(i)} \text {exp} (z_i \cdot z_a / \tau )}. \end{aligned}$$
(7)

Being similar to the notation of A(i) as aforementioned in (4), P(i) denotes the set of indices of all positives distinct from i, i.e., \(P(i) \equiv \{ p \in A(i): y_p=y_i \}\), and the symbol \(\vert \cdot \vert\) denotes the cardinality. In practical, \(\widetilde{x}_i\) and \(\widetilde{x}_p\) here are common augmentations of the original sample x, such as random cropping, random color distortion and their composition.

Note that SCL is originally devised for the classification on natural images, where the source sample x as well as the corresponding augmented views \(\widetilde{x}_i\) and \(\widetilde{x}_p\) are benign. Nevertheless, our goal is to develop robustness enhancement solutions. To this end, the most direct way is to replace the common augmentations with their adversarial versions and then use for contrastive learning, in a similar way with self-supervised contrastive pre-training. Thereby, an SCL loss that adapts to adversarial samples can be given by a SCL loss over an adversarial contrastive pair \((\widetilde{x}_i+\delta _i, \widetilde{x}_p+\delta _p)\), which has the following form:

$$\begin{aligned} \sum \limits _{i \in I} \frac{-1}{\vert P(i) \vert } \sum \limits _{p \in P(i)} \log \frac{\text {exp} (z^\text {adv}_i \cdot z^\text {adv}_p / \tau )}{ \sum _{a \in A(i)} \text {exp} (z^\text {adv}_i \cdot z_a / \tau )}, \end{aligned}$$
(8)

where \(z^\text {adv}_i\), \(z^\text {adv}_p\) denotes the projected representation of the adversarial sample \(\widetilde{x}_i+\delta _i\), \(\widetilde{x}_p+\delta _p\), respectively. More specifically, a PGD adversary can iteratively adjust perturbation by gradient descent in a same manner as (2) to maximize the modified SCL loss (8).

Intuitive as it might look, there is a pitfall for this implementation. By minimizing (8), the representation we learn is invariant to different cascades of augmentation and adversarial perturbation, i.e., \(\widetilde{x} + \delta\), while what we hope to defend against are adversarial samples \(x + \delta\). As observed in Xie and Yuille (2020), in the context of deep learning, the representation statistics of natural sample x and the corresponding adversarial version \(x+\delta\) can be very different. This observation is also hold true when referring to \(\widetilde{x} + \delta\). Thus, the invariance we yield by training on \(\widetilde{x} + \delta\) can not guarantee smooth boundary for us to make correct prediction on \(x+\delta\).

Several recent studies Mao et al. (2019), Kim et al. (2020) have attempted to leverage metric learning in adversarial training, specifically, bringing the natural and adversarial samples of the same class (Mao et al., 2019) or same instance (Kim et al., 2020) closer while enlarging the margins between distinct class or instance. Inspired by this, we propose to enforce the supervised contrastive loss over a pair of common augmentation and the adversarial attack, which is essentially aligned with the intuition of the classical adversarial training (Zhang et al., 2019). This is expected to mitigate the aggressive consistency between different worst-case based perturbations, thus to prevent the degradation of the representation quality. Formally, the proposed adversarial supervised contrastive loss can be written as:

$$\begin{aligned} \ell _{\text {ASCL}} (\widetilde{x}_i , x + \delta ) = \sum \limits _{i \in I} \frac{-1}{\vert P(i) \vert } \sum \limits _{p \in P(i)} \log \frac{\text {exp} (z_i \cdot z^{\text {adv}} / \tau )}{ \sum \nolimits _{a \in A(i)} \text {exp} (z_i \cdot z_a / \tau ) }, \end{aligned}$$
(9)

where \(z^{\text {adv}}\) denotes the projected representation under adversarial sample \(x + \delta\).

Using adversarial contrastive views and the proposed supervision scheme in conjunction, adversarial perturbation \(\delta _{\text {ASCL}}\) can be generated according to procedure (2) with a maximized \(\ell _{\text {ASCL}}\) over the proposed contrastive pair \((\widetilde{x}_i, x + \delta )\).

Equipped with such supervision, we expect to train a robust model and empower the class-awareness in the contrastive learning framework.

3.2.2 Ensemble of adversarial samples

Recent study empirically shows that different pre-training tasks tend to induce models with different adversarial vulnerabilities (Chen et al., 2020). They thereby proposed leveraged multiple pretext tasks, including selfie (Trinh et al., 2019), jigsaw (Gidaris et al., 2018) and rotation (Carlucci et al., 2019; Noroozi & Favaro, 2016) to achieve complementary traits. Despite better defense performance can be achieved by the ensemble of self-supervised learning tasks, these handcrafted pretext tasks require heuristic scheme. Moreover, it is extremely computation consuming for models ensemble.

To tackle this problem, we proposed to use the ensemble of adversarial samples, rather than the ensemble of multiple pre-training tasks, which corresponds to training several deep models simultaneously. The rationale behind our proposal is that, adversarial samples can be used as a kind of data augmentation and lead to gains in robustness (Tack et al., 2021; Xie et al., 2020).

To this end, we use different attack generation strategies to produce ensemble of adversarial samples. In particular, we leverage the proposed supervised adversarial contrastive loss \(\ell _{\text {ASCL}}\) and the self-supervised contrastive loss \(\ell _{\text {CL}}\) for the label-wise and label-independent attack generation, respectively. As will be verified later, such different adversarial samples can offer complementary benefits, i.e., calibration and generalization to model performance as they are statistically distinct.

3.2.3 Overall training objective

Now we derive an overall training objective for our method, i.e., the proposed adversarial supervised contrastive loss in addition to the self-supervised robust training, with hyper-parameter \(\lambda _1\) and \(\lambda _2\) to control the strength of self-supervision and supervision regularization, respectively. Formally, the total loss can be given by:

$$\begin{aligned} \min \limits _{\theta } \left[ \mathbbm {E}_{x \in \hat{\mathcal {X}}} \max \ell _{\text {AT}} + \lambda _1\mathbbm {E}_{x \in \mathcal {X}} \max \limits _{\Vert \delta _{\text {CL} \Vert _{\infty }} \le \epsilon } \ell _{\text {CL}} + \lambda _2\mathbbm {E}_{x \in \hat{\mathcal {X}}} \max \limits _{\Vert \delta _{\text {ASCL}} \Vert _{\infty } \le \epsilon } \ell _{\text {ASCL}} \right] \end{aligned}$$
(10)

where \(\hat{\mathcal {X}}\) and \({\mathcal {X}}\) are labeled and unlabeled datasets, respectively.

Note that our method is agnostic to the implementation of adversarial training. Therefore, we can use either the basic version or any other variants of the adversarial training objective for \(\ell _{\text {AT}}\) [e.g., AT (Madry et al., 2018), TRADES (Zhang et al., 2019) and MART (Wang et al., 2020), and then combined with the proposed loss term].

4 Experiments and results

In this part, we conduct comprehensive evaluations to verify the effectiveness of our proposed ASCL, including its benchmarking robustness, ablation studies and analysis, so as to provide some additional insights.

4.1 Experiment setups

To be comparable to the baselines, we follow the common protocols and experiment setups suggested by previous literature (Chen et al., 2020; Tack et al., 2021; Jiang et al., 2020; Kim et al., 2020; Fan et al., 2021)

4.1.1 Training details

We empirically benchmark our models on CIFAR-10 (Krizhevsky et al., 2009), which is widely used in adversarial training. We adopt ResNet-18 (He et al., 2016) as the backbone in all experiments, as was also used by Zhang et al. (2019). We apply stochastic gradient descent optimizer with a momentum of 0.9 and a weight decay of \(2 \times 10^{-4}\) during training. We employ a piece-wise learning rate schedule, which is initially set to 0.1 and decayed by a factor of 10 at 60% and 80% of the training progress, and a constant learning rate of o.1 to train the linear layer for 10 epochs unless otherwise specified. Batch size is set 512 in all experiments. As for the attacks during training, we generate the adversarial samples by standard Projected Gradient Descent (PGD) optimization the budget \(\epsilon = 8/255\) and step size \(\alpha = 2/255\) with \(l_{\infty }\) constraint.

By default, we first use ResNet-18 (He et al., 2016) (with the last fully connected layer excluded) as the the encoder architecture in the adversarial pre-training, and then the pre-trained encoder is frozen and added with a zero-initialized fully connected layer fine-tuning. We apply the widely used Cross-Entropy for standard fine-tuning, and the TRADES loss (Zhang et al., 2019) for adversarial fine-tuning. The hyper-parameters \(\lambda _1\) and \(\lambda _2\) are respectively set 0.5 and 0.2 for a well balance, basing on our experimental observations.

4.1.2 Evaluation setups

We evaluate the models by attacking them with untargeted adversarial samples in white-box setting, which is more challenging than the targeted ones to defense against. In particular, we apply 20-step PGD optimization with \(l_{\infty }\) constraint to generate the adversarial perturbations for all the competing methods to ensure a fair comparison.

Evaluation metrics we employ are: (1) Robust Accuracy (RA), i.e., the classification accuracy on the perturbed testing set; and (2) Standard Accuracy (SA), i.e., the accuracy on natural images without adversarial perturbations. Both RA and SA of models under attacks are reported to avoid sacrificing the nominal performance for adversarial robustness.

Unless otherwise specified, we evaluate the models by attacking them with both PGD attacks (Fan et al., 2021; Madry et al., 2018) and the more challenging Auto-Attacks (Croce & Hein, 2020b). In particular, we apply 20-step PGD optimization with \(l_\infty\) constraint to generate the adversarial perturbations for all the competing methods to ensure a fair comparison. Accordingly, we use RA-PGD and RA-AA to denote RA of models under PGD attacks and Auto-Attacks respectively, for simplicity.

Baselines we compare to can be roughly divided into supervised adversarial training and self-supervised pretraining with finetuning, including: AT (Madry et al., 2018), a solid baseline of supervised adversarial training; SCL (Khosla et al., 2020), a contrastive learning method with the use of labeled data; SimCLR (Chen et al., 2020), which is a vanilla self-supervised CL method; ACL (Jiang et al., 2020), AdvCL (Fan et al., 2021), RoCL and its enhanced version RoCL + rLE (Kim et al., 2020) are robust CL-based pretraining methods that followed with finetuning, where rLE denotes the additional use of robust Linear Evaluation; on the contrary, Selfie and Selfie + DPE (Chen et al., 2020) are self-supervised adversarial training methods that without using contrastive loss.

4.2 Comparing ASCL with the state-of-the-arts

To close in on the true robustness, we evaluate trained models under a wide range of attacks. Tables 1, 2 and Fig. 1 presents the comparison results of model robustness against white-box, black-box and unforeseen attacks, respectively, all showing that our ASCL offers consistently better performance than others. In what follows, we analyze these results and provide additional insights.

Table 1 White-box robustness of ASCL compared with supervised and self-supervised baselines, in terms of RA-PGD, RA-AA and SA on the benchmark CIFAR-10 dataset

4.2.1 White-box robustness

4.2.1.1 Setup

To extensively evaluate the robustness without gradient obfuscation (Athalye et al., 2018), we first consider a wide range of adversarial attacks in white-box setting, i.e., PGD attacks with 20 iterations (Madry et al., 2018) and the Auto-Attacks (Croce & Hein, 2020b), which further consists of untargeted/targeted FAB (Croce & Hein, 2020a) and square attack (Andriushchenko et al., 2020) with 5000 quires, making it a well-recognized strong attack to defend.

4.2.1.2 Results and analysis

Table 1 reports the performance of the proposed ASCL and the competitive methods. The most direct baseline that any new adversarial training method should be compared with, is the plain adversarial training, AT (Madry et al., 2018). We can observe that ASCL yields a significant improvement over AT on robustness, leading a clear margin of 9.04% and 5.63% on RA-PGD and RA-AA, respectively. This is an impressive result which provides strong evidence that self-supervised representation learning is effective for robustness enhancement. Moreover, ASCL achieves comparable performance to another solid baseline, TRADES (Zhang et al., 2019), which is a supervised method with strong regularization. Note that both AT and TRADES require much more epochs to reach a plateaus as they train from scratch. This makes our method more appealing in practice.

We then compare against SimCLR (Chen et al., 2020) and SCL (Khosla et al., 2020), where the former is a powerful contrastive learning framework that our method builds on, and the latter one is an extension of traditional contrastive loss to the supervised setting. Result shows that SCL and SimCLR can obtain high accuracy on natural samples, nevertheless, both are extremely vulnerable to adversaries. This is not surprising as SCL and SimCLR are vanilla CL-based learning methods that without training with adversarial samples. We have also noticed that both RoCL and Selfie lead considerable improvement in robustness. Nevertheless, RoCL shows a frustrating gap between RA-PGD and RA-AA, and Selfie heavily relies on heuristics for pretext tasks (Trinh et al., 2019). Although Selfie + DPE (Trinh et al., 2019) improves over its origin and yields comparable performance to our ASCL, it requires prohibitively expensive ensemble training. Methods stand out among the rest are ACL (Jiang et al., 2020) and AdvCL (Fan et al., 2021), which indeed bring improvement over the prior arts but with much overhead, for a thorough finetuning and pseudo labeling respectively. On the contrary, ASCL suggests a lightweight finetuning without requiring any plug-in technique or extra offline training. As a whole, the proposed ASCL offers substantial improvement over almost all baseline methods under white-box attacks.

4.2.2 Black-box robustness

4.2.2.1 Setup

Previous works (Carlini & Wagner, 2017; Uesato et al., 2018) show that defenses relied on obfuscated gradients may give a false sense of robustness. In this part, we verify our models in a black-box setting, where the target models are hidden, so as the gradients for generating perturbations. To this end, we first train the ResNet-18 with baseline methods and our ASCL as target models for evaluation. Then, we use models that trained with different methods as the source models, and compute the perturbations according to gradients that provided by the source models on the inputs. Specifically, we craft adversarial samples on AT, ACL and AdvCL model to test our ASCL trained model, and the rest can be done in the same manner.

4.2.2.2 Results and analysis

As shown in Table 2, our ASCL consistently achieves the best robustness under black-box attacks, as denoted in the bold. Moreover, the proposed ASCL can also be used to generate stronger black-box attacks compared to baseline methods, which can be implied by the lowest values in each rows. For instance, the target model AT yields the worst accuracy when it defenses against attacks that generated by the gradients of ASCL, with a significant drop of 7.64% and 2.50% on robust accuracy compared to that sourced by ACL and AdvCL, respectively.

Table 2 Black-box robustness of ASCL compared with baseline methods on the benchmark CIFAR-10 dataset
Fig. 1
figure 1

Robustness against unforeseen attacks on CIFAR-10-C. The proposed ASCL outperforms the baseline on most types, demonstrating more general robustness against various perturbations

4.2.3 Robustness against unforeseen attacks

4.2.3.1 Setup

In addition to evaluating at adversarial attacks, which includes white-box and black-box attacks, we also verify if our robustness enhancement can consistently hold against unforeseen attacks. As suggested by previous works (Kang et al., 2020; Hendrycks & Dietterich, 2019), we employ 19 common corruptions on Cifar-10-C as unforeseen attacks for testing, e.g., impulse noise, JPEG compression and elastic transform. Figure 1 presents the performance comparison between our proposed ASCL and the baseline method in Hendrycks et al. (2019), which is not specially trained for adversarial defense but rather a more generalized robustness against various corruptions.

4.2.3.2 Results and analysis

As we can see, our method achieves consistent robustness gains in defending most of the unforeseen perturbations (18 out of 19 types), where the gain ranges from 0.46 to 12.99% while a slight drop of 1.05% on the contrast corruption. Remarkably, our method brings about an overall gain of 5.55% when averaging on 19 unforeseen attacks, indicating improvement in robustness are spread across perturbations. We note that as inputs might be attacked in various ways that not have been encountered in training, such robustness should be an indispensable property for models deployed in real-world.

4.3 Ablation study and analysis

4.3.1 The transfer of learned representation across datasets

4.3.1.1 Setup

In what follows, we verify our method on transfer learning in the scenario of adversarial defense, in order to gain some insight into the transferability of robust representations from adversarial pre-training to fine-tuning. We closely follow the protocols as previous literature (Kim et al., 2020; Fan et al., 2021). Specifically speaking, models are first pre-trained on dataset A with the same setup as described in Sect. 4.1. We froze the trained network except the logits layer, where we use a zero-initialized fully connected layer instead and then fine-tune it on another dataset B. We perform such an evaluation on two transfer learning tasks, namely, Cifar-10 \(\rightarrow\) STL-10 and Cifar-10 \(\rightarrow\) Cifar-100. The former one transfers the learned representation from a larger dataset to a smaller one while the latter is the opposite.

4.3.1.2 Results and analysis

Table 3 shows how our models compare to one-shot adversarial training as well as baselines that similar to us in separating the adversarial training into pre-training and fine-tuning. We observe that ASCL leads a comfortable margin over one-shot training while RoCL (Kim et al., 2020) could barely yield comparable results. This implies that our improvement over AT (Madry et al., 2018) essentially benefits from the richer representations learned by our robust pre-training, rather than the two-step learning. The advantage of integrating the supervised contrastive loss into adversarial training can be confirmed with the substantial improvement over the prior arts ACL (Jiang et al., 2020) in [RA, SA] by [6.05%, 9.94%] on Cifar-10 \(\rightarrow\) STL-10, and [4.97%, 3.34%] on Cifar-10 \(\rightarrow\) Cifar-100, respectively. More intriguingly, our proposal improves RA as well as SA, which is non-trivial as previous studies have reported a common trade-off between the robust and clean accuracy in adversarial training (Tsipras et al., 2019; Zhang et al., 2019).

Table 3 Evaluation results on transfer learning in the scenario of adversarial defense

4.3.2 Visual interpretability of the learned representation

4.3.2.1 Setup

To further demonstrate the effectiveness of our proposal, we visualize the feature inversion map (Mahendran & Vedaldi, 2016) of neuron response. It has been shown in Boopathy et al. (2020), Engstrom et al. (2019), Kaur et al. (2019) that representations of robust models are more aligned with human-recognizable features than those of the standard networks. Therefore, the feature inversion map can be use as a good indicator for robustness. Concretely, we acquire such maps by maximizing the activation of a specific component of the representation vector with respect to the input images. Formally, we solve the optimization problem:

$$\begin{aligned} x^{\prime } = \arg \max \limits _{\delta } g(x_0 + \delta )_i, \end{aligned}$$
(11)

where \(x_0\) denotes random seed inputs (images/noise), and i denotes the ith component of the vector.

4.3.2.2 Results and analysis

Fig. 2 shows that for model trained with our method, different neurons consistently represent similar pattern across highly distinct input images. That is in starFk contrast to the case of standard models, where the representations tend to be perceptually meaningless and change drastically with the inputs, implying poor model explanation and classifier smoothness. Remarkably, we can still observe consistency of feature representations even when we change the seed input from natural image to random noise, not just across different natural images. This justifies that ASCL equips the model with more perceptually-aligned representation thus to better robustness.

Fig. 2
figure 2

The comparison on feature inversion map of ResNet-18 neurons when the proposed ASCL and standard training is applied. In the left column we present the seed inputs that randomly selected from CIFAR-10, and in the subsequent columns we visualize the activation of a few components of the representation vector with respect to seed inputs. The striking contrast between the visual interpretability of our model (top) and that of the standard model (bottom) demonstrates significant robustness enhancement achieved by our proposal

4.3.3 Class-wise discrimination

Furthermore, we demonstrate the advantages of the proposed method from the perspective of class-discriminate ability, which can be illustrated by the visualization of the latent representations learned by models. Figure 3 shows the t-SNE visualization (Arora et al., 2018; Rauber et al., 2016) on sampled CIFAR-10 images, with different colors representing different ground-truth label. We visualize the penultimate fully connected layer of model trained with Rocl (Kim et al., 2020), ACL (Jiang et al., 2020) and the proposed ASCL, respectively.

Fig. 3
figure 3

t-SNE visualization of latent representations learned with different methods. The proposed ASCL shows a much clearer separation among different classes

As we can see in Fig. 3, RoCL (Kim et al., 2020) and ACL (Jiang et al., 2020) show indistinct boundary, with representations getting closer together. Note that a good discriminative model will push apart clusters of samples from different classes, so that it will be difficult to misclassify the images into wrong categories. In striking contrast to (a) and (b), we can observe clearly separated clusters in (c), where samples belonging to the same class are pulled together while points from different classes are pushed apart, indicating that ASCL is able to better discriminate images among different classes than the baselines.

4.3.4 Linear separability of the learned representations

4.3.4.1 Setup

To further evaluate the learned representations, we carry out the linear evaluation. We follow the evaluation protocol in Chen et al. (2020). Specifically speaking, we attach a linear prediction head on top of the pre-trained encoder, with a stop gradient on the input to the linear head, so as to prevent the label from affecting the pre-trained encoder during evaluation. We separately perform standard training and adversarial training on the linear classification head, and the evaluate the representation learned by ASCL and baseline methods in terms of both SA and RA, which are use as an indicator for the representation quality. Higher SA and RA implies better standard linear separability and adversarial linear separability respectively.

4.3.4.2 Results and analysis

As shown in Table 4, our method yields the best results with clear performance margins over the others under different evaluation settings, as denoted in bold with respect to each column. Furthermore, we analyze the results by row, to acquire insights from the pre-trained representations. It is notable that the gap induced by Selfie (Trinh et al., 2019) between RA of standard training and that of adversarial training is extremely large, which indicates that the eventual robustness of model is heavily relied on adversarial fine-tuning rather than the self-supervised pretraining. Though significant improvement on RA (from 6.3 to 37.65% ) can be obtained when we switch the tuning to the adversarial mode, its robustness is still lags behind other methods. On the contrary, we can observe the tightest gap (0.71%) achieved by ASCL, demonstrating that representations learned by our proposed supervised CL-based pre-training is superior to baseline methods, all of which are in self-supervision. This makes our ASCL more appealing as it is much less dependent on the retaining, and thus enabling a merely lightweight finetuning in downstream tasks.

Table 4 Comparison of linear separability of ASCL with baseline methods

4.3.5 Robust generalization analysis through loss surface

There exists a large body of work investigating the correlation between robust generalization and salient properties of the function surface of DNNs (Keskar et al., 2017; Li et al., 2018; Wu et al., 2020). It has been empirically veried and commonly accepted that atter loss surface tends to yield better generalization, and this understanding is further utilized to design regularization (e.g., Garipov et al. 2018; Wei et al. 2020; Ishida et al. 2020). Here, we leverage this agreement to analyze the robust generalization. In particular, we visualize two types of loss landscape to illustrate the function surface, i.e., input loss landscape and weight loss landscape. The former indicates the change of loss in the vicinity of input data, and the latter depicts the geometry of the loss landscape around model weights instead of the randomly sampled input (Wu et al., 2020).

Figure 4 shows that the resultant model of AT (Madry et al., 2018) exhibits a convolved surface, implying the occurrence of robust overfitting that caused by supervision. While TRADES (Zhang et al., 2019) significantly improves AT on robust generalization, it is not so strong as our method when defending against perturbation of larger radius. Among these subplots, Fig. 4c presents the smoothest and flattest input loss landscape, as well as the lowest level of loss.

Fig. 4
figure 4

Visualization of loss landscape on CIFAR-10 test set for various models

We further present a comparison on robust generalization through the lens of weight loss landscape. Similar observation can be found in Fig. 5, where the proposed ASCL produces a flatter surface w.r.t. weight perturbation than other competing methods, i.e., AT and TRADES. These results confirm our conjectures that the proposed ASCL, which is rooted in contrastive learning, shows great advantages in learning generalized representations over the supervised baselines.

Fig. 5
figure 5

Visualization of loss landscape w.r.t. model weights using different methods on the CIFAR-10 test set

4.3.6 Towards larger steps and larger radius

To further examine whether our trained models can be well generalized, we perform evaluation under various PGD attacks, with the amount of PGD steps varied from 0 to 100 and the perturbation budget varied from 0 to 32/255, respectively. Figure 6 shows that the proposed ASCL consistently outperforms the baselines with a non-trivial margin, ensuring security under a wide range of attacks and thus a good generalization.

Fig. 6
figure 6

Robustness accuracy of models evaluated under attacks with different strength, in terms of varied amount of PGD steps (left) and perturbation budgets (right)

4.3.7 Ablation on components

In this part, we compare against the ablations of our full model to explore if each component is essential. Quantitative results reported in Table 5 demonstrate that each component of the proposed training object is indispensable for the final performance, as the robust accuracy under both PGD attacks and Auto-Attack gets improved step-by-step with the integration of the component. It is interesting to find that the standard accuracy first gets reduced when added with the self-supervised contrastive objective, yet it goes up as the proposed supervision is incorporated. This suggests that our method greatly benefits from the class-discrimination that empowered by the supervised contrastive learning. This is in consistent with Table 1, showing that our method sacrifices less standard accuracy than the baseline for robustness.

Table 5 Ablation study on each component of the proposed training objective

To delve into the proposed ASCL, we go beyond adversarial robustness and investigate how the ablations of our full model affect the robustness against a broad range of corruptions (e.g., Gaussian blur, impulse noise). As shown in Fig. 7, simply applying self-supervised contrastive loss to the plain AT objective can substantially improve the performance in defending all types of perturbations. Moreover, adapting the supervised contrastive loss can boost the accuracy still higher. This observation demonstrates that the class-discrimination brought by the supervision indeed enhance the robustness against not only the adversarial samples but also those common perturbations, making our full objective more appealing in real-world applications than its ablations.

Fig. 7
figure 7

Component analysis of the proposed method. Reported performance on each unforeseen corruption is measured on ResNet-18 trained with standard AT, standard AT with self-supervised contrastive loss, standard AT with both self-supervised and supervised contrastive loss, respectively

4.3.8 Ablation on attack generation in pre-training

We now investigate the performance difference when different attack generation strategies are applied. The label-based strategy trains the model only with attacks that generated as the convention, i.e., by solving the maximization of the classification loss with PGD optimizer, while the label-free counterpart crafts the perturbation by maximizing the self-supervised contrastive loss without using any label.

As shown in Table 6, whereas the label-free strategy improves robustness accuracy over the conventional label-based generation by 2.84% and 4.06%, it degrades standard accuracy. That is not surprising, as contrastive learning considers no label information and provides no advantage in discrimination on standard classification. It is interesting that such label-free attacks boosts the robustness on \(\epsilon =16/255\) by a larger margin than the case where \(\epsilon =8/255\), which thanks to the more generalized representation that obtained by contrastive learning. Remarkably, when inject both the class-wise attacks and also the label-free generated attacks into our pretraining, we are able to achieve the best performance in terms of SA and TA simultaneously. We consider this to consist with another recent study that connecting the adversarial perturbation with data augmentation for performance gains (Tack et al., 2021), and we will leave it to our future work.

Table 6 Comparison of pre-trained model with different attack generation schemes for ASCL

4.3.9 Ablation of defense approaches in fine-tuning strategy

Next, we analyze how the finetuning strategy affects the eventual robustness. Table 7 reports SA and RA that under attacks with the perturbation budget of 8/255 and 16/255, respectively. We begin by verifying the effectiveness of adversarial finetuning. Comparison between LSF and LAF, between FSF and FAF show that the use of adversarial finetuning is advantageous, as it yields consistent improvement in RA and/or SA either on the scenario of lightweight linear setting or full model finetuning.

Table 7 Ablation results on finetuning strategies

In the next step, it is natural to compare the impacts of linear finetuning with full finetuning. As shown in Table 7, the utilization of full finetuning harms the robustness, leading to a significant drop by 6.07% in robust accuracy (FAF vs. LAF). The robustness even hits the ground (close to 0%) when full standard finetuning is applied. We conjecture that one serious problem when we using full fine-tuning is that most learned embedding has been wiped away, which will further present scalability challenges on downstream tasks. Interestingly, our experimental results somewhat show a different conclusion from the previous work (Chen et al., 2020; Fan et al., 2021), which conclude that the full adversarial finetuning is able to gain substantial robustness boost while only the standard finetuning in full mode weakens the robust representation learned in pretraining. On the whole, LAF offers better robustness (than standard finetuning) yet it consumes much less computation (than full model finetuning) in our observation, and thus it is adopted as our default finetuning scheme.

4.3.10 Batch normalization strategy

We further study the performance difference when different batch normalization (BN) strategies are applied, i.e., (1) using the same BN for both branches of the contrastive learning; and (2) using separated BN for different branches. Table 8 shows the performance comparison of models with different BN options. Remarkably, the performance gap between two settings induced by the proposed method is much smaller than the baseline. We conjecture that since these BN strategies are applied in the contrastive pre-training while the performance is evaluated after the fine-tuning completed, a smaller gap might indicate that the proposed pre-training inherits strong discriminative capability from supervised contrastive learning. This also echos the observation that impressing performance can be achieved by the proposed robust pre-training with just a lightweight fine-tuning scheme followed.

Table 8 Performance comparison with different batch normalization strategies for ACL (Jiang et al., 2020) and ASCL

5 Conclusion

In this work, we propose ASCL, which is an extension of contrastive learning that targets for adversarial robustness. We show that injecting supervision component into the contrastive-style robust pre-training is benefit for class-discrimination. We further show that the incorporation of more diverse adversarial attacks could offer complementary benefits. Comprehensive evaluations demonstrate that ASCL is able to bring consistent gains over state-of-the-art methods in diverse scenarios, e.g., white-box attacks, black-box attacks, and also natural corruptions. Moreover, ASCL shows impressive results in robust transfer learning. Hence, it may help building a more robust and secure learning system in practice.

Adversarial samples we consider in this paper are based on the extensively studied threat models, i.e., \(l_\infty\), and thus the resistance is also put up on them. Meanwhile, the deployed systems in real world are confronted with adversarial attacks from all sides, where we still far from complete robustness. Nevertheless, we hope that this work inspires others to consider extending advanced contrastive learning methods into adversarial defense scenarios. Potential future work includes the scalability of our proposed method to larger datasets and models, and a broader fusion of other self-supervised learning techniques.