1 Introduction

Aware of the great potential of deep learning (DL) systems, big companies such as Google, Microsoft, and Facebook take huge engineering efforts to build and maintain DL-based tools and contribute to the development of the DL field. DL techniques are also useful to support software engineering tasks, e.g. image recognition for software security [1, 2] and image classification for bugs detection in mobile apps [3].

However, one of the most important hurdles in developing a DL model (i.e. a deep neural network (DNN) with competitive performance is the demand of a large amount of labeled data required for training. Although massive data can come at little to no cost, labeling all of them is time-consuming and almost impossible, especially when expert knowledge is indispensable. To overcome this problem, researchers have proposed approaches to maximize learning capabilities in situations where unlabeled data are abundant while labeling capacities are limited.

Active learning is such an approach and has been widely applied in various domains, such as image classification [4], information extraction [5], software defect prediction [6], and acoustical signal classification [7]. Active learning aims to train a DL model using a small number of but carefully selected labeled data given a limited budget. This model can achieve a similar performance compared to the model that would be trained on all (labeled) data. This process is usually iterative: at each iteration (stage), a so-called acquisition function determines the set of unlabeled data to use and an oracle (for instance, a human annotator) is queried to provide the labels of these data. Past studies have shown that active learning has the potential to reach the same performance as a model trained with all data labeled using only 20% of these data (for the street view house numbers(SVHN) dataset) [8].

An important limitation of active learning so far is that it focuses only on clean performance (most often, test accuracy [4, 8, 9]) and ignores other important quality indicators. In particular, active learning does not produce models that are robust to adversarial examples, therefore, raises major security concerns [10, 11]. These examples are produced by introducing a small perturbation to benign inputs in a way that it changes the decision of the model although it should not [12]. It is challenging to address that successive methods improve models’ robustness merely by a small percentage [13].

In this paper, we bootstrap an endeavor towards improving the adversarial robustness of DNNs trained through active learning. We propose the adversarial-robust active learning, ARAL, training process that merges active learning with adversarial training [14], the most effective defense against adversarial examples. The key idea of ARAL is to iteratively select the data to label (just like in “standard active learning”) and, then, to train the model not on the original data but on their adversarial counterparts. An essential question is whether existing acquisition functions remain effective in ARAL. We, therefore, conduct the first comprehensive study that evaluates whether these acquisition functions can produce models that are both accurate and robust. The study undertakes experiments on four datasets, six DNN architectures, eleven acquisition functions, and dozens of labeling budgets. Our investigations reveal that while state-of-the-art (SOTA) acquisition functions remain effective when it comes to clean performance, when it comes to robustness, all of them are outperformed by random sampling.

Following our findings, we explore what key factors explain the lack of robustness of models trained using existing acquisition functions. We empirically demonstrate that, while acquisition functions are inherently biased towards selecting data with specific characteristics (e.g. data with the highest entropy), this bias strongly negatively correlates with robustness. This implies that, in ARAL, effective acquisition functions should select a subset of representative data of the whole dataset, with respect to the aforementioned characteristics) of the whole dataset. To this end, we propose a new acquisition function—named density-based robust sampling with entropy (DRE)—that selects data while minimizing the difference between the entropy distributions of the selected set and full dataset. Compared with all existing acquisition functions, the results demonstrate that DRE is the sole acquisition function that achieves higher robustness than random while being competitive in terms of accuracy.

While ARAL aims at training robust models from scratch, an adjacent problem is DL testing, i.e. the problem of finding (adversarial) examples that a trained (high accuracy) DL model misclassifies. Such examples can then be used to retrain the model, thereby improving its adversarial robustness. A crucial part of DL testing is the selection of data from which to generate the adversarial examples. The same acquisition functions that are used in active learning can also serve this purpose. We, therefore, investigate the effectiveness of the acquisition functions in driving the DL testing and retraining. Our evaluation demonstrates that DRE achieves the highest gains in robustness.

To sum up, the main contributions of this paper are:

  1. 1.

    We are the first to formulate and investigate ARAL to solve the problem of training accurate and robust DL models given a limited budget of labeled data.

  2. 2.

    We conduct the first empirical study on the effectiveness of existing acquisition functions in terms of accuracy and robustness. Our results demonstrate that, though some acquisition functions yield higher accuracy than random sampling (by up to 8.08%), none of the functions outperforms random sampling in terms of robustness (by up to 22.50%).

  3. 3.

    We demonstrate that the inherent bias of acquisition functions towards the “informative data” is the cause for their lower robustness (compared to random). Accordingly, we propose DRE, a new acquisition function that yields better robustness (by up to 24.40%) than existing functions while achieving competitive accuracy. DRE, therefore, forms a new baseline for future research on ARAL.

  4. 4.

    We investigate the use of DRE in the adjacent problem of test selection for model retraining. Experimental results show that DRE achieves competitive accuracy and outperforms other acquisition functions by up to 8.21% robustness.

The rest of this paper is organized as follows. Section 2 introduces the background and related work. Section 3 describes the ARAL. Section 4 presents the experimental design. Sections 5 shows the preliminary results of our empirical study. Section 6 describes DRE and its evaluation. The last section concludes this paper.

2 Background and related work

As the focus of this paper is considering adversarial robustness in active learning, this section introduces the relevant related work on active learning, adversarial robustness, and test selection. Readers can refer to several surveys [15,16,17] for comprehensive surveys of these topics.

2.1 Active learning

Active learning is a set of techniques that aims at building high-performing models using only a small set of labeled data. The main hypothesis is that if a model learns from the most informative data, it can obtain a similar performance using substantially fewer data than using the entire training set.

2.1.1 Problem scenario and active learning in deep learning

Typically, in active learning, there are three types of problem scenario [15, 17], membership query synthesis, selective sampling, and pool-based sampling. In the case of membership query synthesis, the model generates data to query for the labels instead of choosing data from the available unlabeled set [18]. Selective sampling also refers to stream-based or sequential active learning [19]. In this scenario, the model receives unlabeled data one at a time and decides whether or not to request the label based on the informativeness. Compared with selective sampling, in pool-based sampling, the model requests labels for a collection of unlabeled data at once in each query [20]. Since in deep learning, data are divided into batches to train DNNs and a single data would not impart significant change in the model [12], the pool-based sampling is the most widely applied active learning scenario.

2.1.2 Empirical study

As many approaches (acquisition functions) exist for active learning, several empirical studies have compared their effectiveness. Arguing that most previous empirical studies were limited to a single model and performance metric, Ramirez-Loaiza et al. [21] rather employ two models and five measures in their study. However, all the used metrics (i.e. precision, recall, F1, accuracy, and under the curve-AUC) evaluate the correctness of the model on clean data only. [22] studies more datasets, classification models, and selection approaches, but the comparison is still based on correctness only. On the other hand, some studies focus on specific applications, such as segmenting Japanese word segmentation [23], text classification [24], learning English verb senses [25], labeling sequence [26], and locating temporal activation in video data [27].

In this paper, we overcome the two main limitations of existing studies. First, we go beyond prediction correctness and consider adversarial robustness as an important success metric of active learning. Second, we compare 11 competitive active learning approaches on common ground configuration.

2.1.3 Active learning for SE

Many tasks in software engineering (SE) benefit from active learning. For instance, Bowring et al. [28] apply active learning in the automatic classification of program behavior. The model is incrementally trained using selected executions that represent unknown behaviors for the model. When performing on the software effort estimation data, active learning is proved to largely prune the training data and quickly find the essential content [29]. Yu et al. [30] show that active learning can help with conducting literature reviews. Recently, Cambronero et al. [31] proposed the use of active learning to automatically infer programs, which can significantly alleviate the difficulty of manual annotations. Based on the Mozilla Firefox vulnerability data, Yu et al. [32] prove that active learning is useful in building a prediction model which learns from the historical source code data and predicts the candidates to inspect. Similarly, Yang et al. [33] study active learning for static code analysis. Active learning also facilitates building defect prediction models [6, 34, 35]. For instance, Tu et al. [35] developed an active learning tool, EMBLEM, to label the most problematic commits and they claim the first use of active learning in commit defect prediction.

2.2 Adversarial robustness

The adversarial robustness of DNNs relates to the ability of the model to distinguish adversarial examples. In other words, similar to the accuracy on clean data, the robustness of a DNN is measured by its accuracy on adversarial examples crafted from the clean data.

2.2.1 Adversarial example

Considering the image classification task, given an input image, the corresponding adversarial example is crafted by adding a carefully calculated perturbation into this image to mislead DNNs [12]. Since this perturbation is hardly perceptible for human beings, an adversarial example is regarded as following the same data distribution and having the same label as the original input. The cause of adversarial examples is still under exploration. Some speculative explanations are the extreme nonlinearity of DNNs [12], the high-dimensional linearity of DNNs [36], the presence of non-robust features [37], insufficient regularisation [38], and insufficient model averaging [36].

2.2.2 Adversarial attack

The approaches towards crafting adversarial examples are called adversarial attacks (also known as adversaries, threat models). In general, adversarial attacks are divided into three types: black-box, gray-box, and white-box. The black-box attacks have no access to the model, and the perturbation is calculated by using the predicted probabilities or logits (score-based), e.g. square attack [39], by simply relying on the max class label (decision-based) [40], or by transferring from another DNN model trained with the same training data (transfer-based) [41]. The gray-box attacks only have knowledge of DNN architectures. In contrast, the white-box attacks have full access to the data, parameters, and DNN architectures. The gradient of training loss is mostly utilized in various white-box methods, e.g. projected gradient descent (PGD) [14]. Besides, there is the Auto attack [42], which combines the black-box and white-box attacks and is commonly used as a strong baseline in robustness evaluation. In this paper, we comprehensively consider the square attack (black-box), PGD (white-box) attack, and Auto attack (adaptive).

2.2.3 Adversarial defense

Mitigating the threat of adversarial examples and improving the robustness of DNNs is of great importance and has attracted considerable attention [15]. Adversarial defense refers to a defense mechanism that secures DNNs against adversarial attacks. Many defense approaches have been proposed, such as adversarial training [14], input denoising [43], input transformation [44], randomization [45], and defensive distillation [46, 47]. Among all, adversarial training with a PGD adversary [14] has been proven to be one of the most effective methods. The difference is that, in standard training, the original training data are fed into the DNN models to tune all the parameters. However, in each epoch of adversarial training, a new set of data is generated by applying the PGD attack to the original training data and then is used to train the model.

2.3 Test selection for retraining

Software engineering researchers have devoted substantial effort towards evaluating and improving the quality of DL models. DL testing techniques aim to expose flaws in DL models by selecting test data (from an existing data pool) or the generation of new test data (i.e. adversarial examples). DL engineers can then use those data to retrain the model and improve it. Multiple test selection metrics have been proposed to solve various DL testing problem, including test accuracy estimation [48, 49], test data prioritization [50,51,52], DNN comparison [53], and DNN retraining [3, 51, 54].

In particular, past research on DNN retraining [51, 54] have shown that, given a trained DNN and a set of unlabeled test data, test selection metrics can select data to retrain the DNN (either on the selected data or on adversarial examples produced from the data) and improve its quality (clean performance or robustness). This testing and retraining process shares the similarity with ARAL as defined in this paper. Indeed, both processes aim to improve the quality of DNNs through the selection of the most appropriate data. The difference is that active learning applies during training and targets the training data, whereas DL testing applies after training and on a set of data not used during training. Therefore, the retraining process can be seen as one-stage active learning after training and on a new set of data. The acquisition functions used in active learning—including the one proposed in this paper—can thus be used for retraining.

3 ARAL: adversarial-robust active learning

Consider a C-class classification problem over a sample space \({\mathcal {Z}}={\mathcal {X}}\times {\mathcal {Y}}\rightarrow {\mathcal {R}}\), and a collection of data \({\mathcal {D}}=\left\{ x_i,y_i\right\} _{i\in [n]}\sim p_{{\mathcal {Z}}}\), where C, \(x_i\in {\mathcal {X}}\), and \(y_i\in {\mathcal {Y}}\) are the number of classes, the data, and label, respectively. \(p_{{\mathcal {Z}}}\) is the data distribution of \({\mathcal {D}}\). Let \({\mathcal {H}}\) be the hypothesis space and \(f_\theta\) be a neural network architecture parameterized by a parameter vector \(\theta \in {\mathcal {H}}\). The loss function is \(J:{\mathcal {H}}\times {\mathcal {Z}}\rightarrow {\mathcal {R}}\). In deep learning, the object of training \(f_\theta\) is to minimize the expected loss (also called as the expected risk):

$$\begin{aligned} {\mathbb {E}}_{x,y\sim p_{{\mathcal {Z}}}}\left[ J\left( \theta _{{\mathcal {D}}},x,y\right) \right] \end{aligned}$$
(1)

where \(\theta _{{\mathcal {D}}}\) indicates that the parameters are tuned using \({\mathcal {D}}\). Note that the value of \(\theta\) depends on the data used to train; thus, to put this in evidence, we use the notation \(\theta _{{\mathcal {D}}}\) indicating that this parameter vector corresponds to the data \({\mathcal {D}}\).

Figure 1 illustrates our proposed process of ARAL and its difference to standard active learning. The data are initially unlabeled and stored in an unlabeled pool, \({\mathcal {U}}=\left\{ x_i\right\} _{i\in \left[ n\right] }\sim p_{{\mathcal {Z}}}\). Given a labeling budget b, an acquisition function calculates the priority of each data to be labeled and the process requests some oracle to label the highest priority data. These newly labeled data are then moved to a labeled pool \({\mathcal {L}}=\left\{ x_i,y_i\right\} _{i\in \left[ m\right] }\sim p_{{\mathcal {Z}}}\) (\(m\le b\)). In standard active learning, at each stage, the set of model parameters \(\theta\) is updated using all data from the labeled pool—so as to minimize the average model loss on these labeled data, that is,

$$\begin{aligned} {\mathbb {E}}_{x,y\sim p_{{\mathcal {Z}}}}\left[ J\left( \theta _{{\mathcal {L}}},x,y\right) \right] \end{aligned}$$
(2)
Fig. 1
figure 1

Overview of pool-based active learning

Just like DNNs trained on clean data are well known to be vulnerable to adversarial examples [12, 36,37,38], DNNs trained with standard active learning should have low robustness as well. To confirm this, we conducted experiments on six subjects to measure the robustness of models that result from standard active learning using different acquisition functions (following the experimental protocol described in Sect. 4). Table 1 summarizes the results. In all cases, the achieved robustness is extremely low (below 0.20%), while the highest accuracy approaches 100%.

Table 1 Standard active learning: result of accuracy (Acc, %) and adversarial robustness (Rob, %) against the Auto attack by 11 acquisition functions

To facilitate the trained DNN to be resilient to adversarial examples, we propose to incorporate the adversarial training into active learning. Compared to standard active learning, for each labeled data we craft an adversarial example—using the PGD attack [14]—and train the DNN with the produced examples (instead of the original labeled data). Hence, the clean labeled data \(\{(x_i, y_i)\}\) are replaced by perturbed examples \({\mathcal {L}}_{adv}=\left\{ x_i+\delta ,y_i\right\} _{i\in \left[ m\right] }\) (\(m\le b\)) during training and the training objective becomes to minimize the risk concerning the adversarial examples:

$$\begin{aligned} {\mathbb {E}}_{x,y\sim p_{{\mathcal {Z}}}}\left[ J\left( \theta _{{\mathcal {L}}_{adv}},x+\delta ,y\right) \right] \end{aligned}$$
(3)

where \(\delta\) is the upper bound of the perturbation. Remark that in adversarial training, the perturbed data produced are used even if the DNN correctly classifies them (i.e. even if the adversarial attack “fails”).

The main research question that we study in this paper is given a few labeled data, whether ARAL can produce robust models. We, therefore, conduct a comprehensive empirical study involving a large set of acquisition functions.

4 Experimental design

4.1 Implementation

All experiments were conducted on a high-performance computer cluster consisting of GPU nodes. To reduce the influence of randomness, each experiment is repeated three times and, in total, 15105 DNNs are trained in this study. The whole results corroborate our findings are also available on our companion project website [55].

4.2 Datasets and models

Four popular publicly available datasets, modified national institute of standards and technology (MNIST) [56], Fashion-MNIST [57], SVHN [58], and Canadian institute for advanced research (CIFAR)-10 [59] are selected for evaluation. These datasets have previously been used for robustness assessment in the context of DL testing [3]. Both MNIST and Fashion-MNIST include a collection of 28 x 28 grayscale images. MNIST comprises handwritten digits \(0\sim 9\) and Fashion-MNIST presents fashion products. SVHN and CIFAR-10 consist of 32 x 32 RGB images. SVHN shows street view house numbers and CIFAR-10 contains more complex entities, such as vehicles and animals. Table 2 summarizes the detailed information of the datasets and DNN models. In all our experiments, we use the training set to craft the adversarial examples during the adversarial training. The test data are evenly split into a validation and test sets. The validation set is used in the training process to tune the parameters of DNNs, and the test set is independent of training for evaluating the accuracy (directly on the test data) and the robustness (by crafting adversarial examples from the test data).

Table 2 Summary of datasets and DNNs

4.3 ARAL process and test selection

Table 3 summarizes the active learning setups. Concretely, we follow [60] to initialize the labeled pool with a small amount of data uniformly sampled from the unlabeled pool. The initial DNN is tuned using these initial sets via standard training for all the acquisition functions. In test selection for model retraining, 20 epochs are used for SVHN and 10 for the others to ensure the training process converges.

Table 3 Configuration setting of ARAL

4.4 Evaluation measures

We measure the accuracy and adversarial robustness of each DNN. Given a test set and a DNN, the accuracy is the percentage of correctly classified data over the entire test set. Adversarial robustness is calculated by the percentage of correctly classified adversarial examples over the entire adversarial test set. The adversarial test set consists of adversarial examples crafted on the clean test set by a specific adversarial attack. Table 4 lists the parameters of adversarial attacks for ARAL. Both white-box and black-box attacks are employed for robustness evaluation. The white-box attack PGD is commonly used to evaluate DNNs’ robustness. Considering that in this case, the defense DNN knows this attack in advance, the SOTA black-box attack named square attack [39] is also applied. Finally, we use the Auto attack [42], an adaptive attack that is widely utilized as a strong baseline thanks to its ability to overcome common defense mechanisms such as gradient masking [61]. All these attacks are implemented using a public library, Torchattacks [62]. The default setting in the library for other related parameters is applied for each attack.

Table 4 Configurations of adversarial training and evaluation

4.5 Acquisition functions

Given a DNN \(f_\theta\) where \(\theta\) is randomly initialized, unlabeled pool \({\mathcal {U}}\), an acquisition function helps the AL system in each step to query the most informative and useful data to solve the optimization object in Eq. (2). Various acquisition functions have been proposed for data selection. This section reviews 11 widely used acquisition functions.

As Wald showed [63], a solution to solve the optimization object in Eq. (2) is to minimize the maximum of the risk. Based on this, several acquisition functions have been proposed to select the most informative data by assigning and ranking the importance of data. Let \(\left\{ p\left( y=i\mid x;\theta \right) \right\} _{i\in [C]}\) be the softmax output of \(f_\theta\) (recall that C is the number of classes).

4.5.1 MaxEntropy

In information theory, the entropy (also known as Shannon entropy) quantifies the uncertainty of prediction [64]:

$$\begin{aligned} H\left( y\mid x;\theta \right) =-\sum _{i\in \left[ C\right] }p\left( y=i\mid x;\theta \right) \log \left( p\left( y=i\mid x;\theta \right) \right) \end{aligned}$$
(4)

MaxEntropy ranks the uncertainty of data based on the predictive entropy and selects the most uncertain ones.

4.5.2 DeepGini\(^*\)

The Gini impurity is a loss metric in decision tree learning to decide the optimal split from a root node and subsequent splits. It measures the likelihood of misclassification of a new instance, which reflects the uncertainty of a model to this instance. Borrowing the idea of Gini impurity to deep learning, DeepGini [51] is proposed to select the most informative data:

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,max}\,}}}\left( 1-\sum _{i\in \left[ C\right] }p^{2}\left( y=i\mid x;\theta \right) \right) \end{aligned}$$
(5)

Similar to MaxEntropy, DeepGini utilizes the output of \(f_\theta\) to assign the informativeness to data. However, as mentioned by the authors of DeepGini, computing the quadratic sum is easier and simpler than performing the logarithmic computation, which is supposed to give better performance.

4.5.3 BALD

Taking the concept of Bayesian neural network where a DNN is defined by a set of parameters drawn from a posterior distribution, the Bayesian active learning by disagreement (BALD) [65] seeks data for which the parameters under the posterior disagree about the prediction the most.

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,max}\,}}}\left( H\left( y\mid x;{\mathcal {U}}\right) -{\mathbb {E}}_{\theta \sim p_{\theta ,{\mathcal {U}}}}\left[ H\left( y\mid x;\theta \right) \right] \right) \end{aligned}$$
(6)

In other words, BALD can be explained as identifying on which the DNN is on average most uncertain about the prediction (big \(H\left( y\mid x;{\mathcal {U}}\right)\)) but existing model parameters are confident (small \({\mathbb {E}}_{\theta \sim p_{\theta ,{\mathcal {U}}}}\left[ H\left( y\mid x;\theta \right) \right]\)). In practice, to approximate the inferences, the Monte Carlo dropout is widely applied due to its great capacity and low cost. We set \(T=10\) with probability 0.1 at test time to sample different DNNs as [4].

4.5.4 DropOut-Entropy

Instead of computing the entropy over one DNN with fixed parameters, this acquisition [66] function calculates the uncertainty over multiple Bayesian DNNs inferred by the Monte Carlo dropout. Thus, the object changes to find data that maximize the uncertainty as follows:

$$\begin{aligned} H\left( y\mid x;{\mathcal {H}}\right) =-\sum _{i\in \left[ C\right] }p\left( y=i\mid x;{\mathcal {H}}\right) \log p\left( y=i\mid x;{\mathcal {H}}\right) \end{aligned}$$
(7)

where \(p\left( y=i\mid x;{\mathcal {H}}\right) =\frac{1}{T}\sum _{t\in \left[ T\right] }p\left( y=i\mid x;\theta ^t\right)\) is the average predicted output over T times of applying dropout to the model. \(\theta ^t\) denotes the parameters of the t-th dropout. We set \(T=10\) with probability 0.1 at test time to sample different DNNs as [4].

4.5.5 LC

Least confidence uses the most probable label of an instance to capture the uncertainty. It ranks data based on the highest posterior probability and selects data with the least confidence on the prediction:

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,max}\,}}}\left( 1-p\left( y=y'\mid x;\theta \right) \right) \end{aligned}$$
(8)

where \(y^{\prime} = {\text{arg}}{\mkern 1mu} {\text{max}}{\mkern 1mu} _{{i \in \left[ C \right]}} \left( {p\left( {y = i\rm{\mid }x;\theta } \right)} \right)\) indicating the predicted class label by \(f_\theta\).

4.5.6 Margin

Since LC only utilizes the most confident class label, the information about the remaining labels is discarded but can also be useful. To solve this issue, the margin sampling [67] ranks data based on the difference between the most confident and second most confident labels and chooses the data with a small difference:

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,min}\,}}}\left( \underset{\mathbf{{i}}\in \left[ C\right] }{{{\,\mathrm{arg\,max}\,}}}\left( p\left( y=i\mid x;\theta \right) \right) -\underset{\mathbf{{i}}\in \left[ C\right] /y'}{{{\,\mathrm{arg\,max}\,}}}\left( p\left( y=i\mid x;\theta \right) \right) \right) \end{aligned}$$
(9)

The hypothesis is that if a DNN predicts a similar probability on the top two labels for an instance, then this instance is not well learned and remains close to the decision boundary.

4.5.7 MCP\(^*\)

Noticing that the data selected by margin sampling might be unbalanced distributed concerning the decision boundary areas, the Multiple-Boundary Clustering and Prioritization (MCP) [54] improves the margin sampling by uniformly selecting data from different areas. Besides, it computes the priority by the ratio of probabilities of the top two labels instead of using difference. Each decision boundary area is a cluster defined by the top two classes. Data are selected from each cluster with high priorities.

4.5.8 DFAL

To approximate how close an instance is to the decision boundary, the DeepFool active learning (DFAL) [4] utilizes the magnitude of the minimum perturbation to successfully craft an adversarial example of this instance by the DeepFool attack. Indeed, adversarial attacks are designed to push the clean data to cross the decision boundary. The object of DFAL is

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,min}\,}}}\left( DeepFool\left( x,f_\theta ,L_p\right) \right) \end{aligned}$$
(10)

where \(DeepFool\left( x,f_\theta ,L_p\right)\) is a function that attacks \(f_\theta\) given x using the \(L_p\) norm (\(p=2\)) and outputs the perturbation between x and its adversarial example.

4.5.9 EGL

Given that the DNN training generally uses gradient-based optimizations, the expected gradient length (EGL) [9] ranks an instance with high importance if it would induce the greatest change in the gradient. A challenge is that in active learning, the true labels are not available, thus, EGL assumes all the possible labels to data and computes the expected gradient. The data are selected by

$$\begin{aligned} \underset{\mathbf{{x}}\in {{\mathcal {U}}}}{{{\,\mathrm{arg\,max}\,}}}\left( \sum _{i\in \left[ C\right] }p\left( y=i\mid x;\theta \right) \parallel \triangledown J\left( \theta ,x,y=i\right) \parallel \right) \end{aligned}$$
(11)

where \(\parallel \cdot \parallel\) is the \(L_2\) norm (Euclidean distance), \(\triangledown J\left( \theta ,x,y=i\cdot \right)\) denotes the gradient of the loss function at x given \(\theta\), and \(y=i\).

Different from focusing on minimizing the maximum of the risk, some acquisition functions solve the optimization object in Eq. (2) via adding data that are far from the labeled data, such as Core-set.

4.5.10 Core-set

The Core-set selection [8] converts the optimization object in Eq. (2) to the k-center problem by setting an upper bound. It solves the k-center problem by selecting the data that are far from data in the labeled pool \({\mathcal {L}}\). Given that the selection of data is time-consuming, the k-Center-Greedy is utilized instead of robust k-Center since they exhibit similar behavior [8].

4.5.11 Random

Random sampling is the simplest and model-free method that each data has an equal probability of selection. This method, strictly speaking, belongs to passive learning but is commonly taken as a baseline in active learning.

Note that the functions with \(*\) are originally designed for test selection, but they all select the most informative data. Thus, they are suitable for active learning.

5 Preliminary results of the empirical study

5.1 Effectiveness of ARAL

The effectiveness of ARAL is investigated in creating models that are both accurate and robust.

Fig. 2
figure 2

Test accuracy (%) of 11 acquisition functions over different stages of ARAL. Baseline is the model adversarially trained using the entire data

Fig. 3
figure 3

Robustness (%) against Auto attack of 11 acquisition functions over different stages of ARAL. Baseline is the model adversarially trained using the entire data

Results Figures 2 and 3 visualize the accuracy and robustness curves of 11 acquisition functions, respectively, for each dataset and model architecture. For brevity, the results of CIFAR-10 are plotted by every 5 stages. First, we observe that most of the acquisition functions keep the ability to achieve the same level of test accuracy as the model adversarially trained with the full dataset. In general, 3%, 2%, 14%, and 22% of the labeled data are enough to reach this level of accuracy for MNIST, Fashion-MNIST, SVHN, and CIFAR-10, respectively. However, some functions (e.g. MaxEntropy, Entropy, DeepGini, LC, and EGL) perform worse than the others, including random sampling. MCP, Core-set, BALD, and (occasionally) DFAL are the best performing metrics and outperform random sampling.

Second, all acquisition functions allow a substantial increase of robustness compared to models trained with original data only. However, the obtained models are less robust than the model adversarially trained with all data (between 2% and 10% robustness difference compared to the best performing acquisition function). Interestingly, random sampling stands out in all the datasets and DNNs, and often by a significant margin. Just like for accuracy, MaxEntropy, Entropy, DeepGini, LC, and EGL, achieve the lowest levels of robustness. The other acquisition functions perform irregularly across different datasets and DNNs, yet remain inferior to random sampling. This indicates that the criteria used by existing acquisition functions have a negative effect on robustness, to the extent that random sampling—a simple, random, model-free, and data-free method—performs much better.

To confirm this conclusion, we conduct statistical analysis and assess whether the difference of robustness between random sampling and the other acquisition functions is statistically significant. The analysis is based on the Wilcoxon signed-rank test [68], which is a nonparametric statistical hypothesis test commonly used for comparing two independent paired samples. In our case, each sample is the robustness obtained based on a given acquisition function. Random sampling is compared with each of the other 10 functions on all datasets/models and three attacks. This yields 180 (\(10 \times 6 \times 3\)) statistical tests. The significance level \(\alpha\) is set to \(5.00E-02\). All these tests have rejected the null hypothesis that there is no difference between random sampling and the other acquisition function, with a p-value ranging from 8.15E-10 to 9.49E-03. Hence, we conclude that random sampling significantly outperforms the other functions in terms of robustness.

Conclusions Using only a limited set of labeled data, ARAL can produce models as accurate as the model adversarially trained with all data. When it comes to robustness, random sampling performs consistently better than the other acquisition functions, though there remains a substantial margin for improvement compared to using all data.

5.2 Data selection biases

As an attempt to explain the better effectiveness of random sampling and devise an improved acquisition function, we investigate the characteristics of the data selected by each acquisition function at all stages of the active learning process. We hypothesize that random sampling performs better than existing acquisition functions because the latter are biased towards the “most” informative data—informative being defined by each function in terms of the metric it uses to prioritize the data. By contrast, random sampling is naturally unbiased as it associates each data with the same probability to be selected. Therefore, on average it selects a set of data that are representative of the whole unlabeled pool.

We aim at establishing a relationship between the biases that the acquisition functions introduce and their inability to produce robust models. From the corresponding acquisition functions, we study four data characteristics, entropy, Gini impurity, least confidence (lc), and margin. The characteristics of the others are not applicable due to the use of dropouts (BALD, DropOut-Entropy) and model independence (MCP, DFAL, EGL, Core-set). Additionally, we measure the bias in the true class label, which is available at first hand and has been studied in the literature [24].

First, we measure the bias of each acquisition function towards each of these characteristics. Given a dataset and DNN, let \({\mathcal {U}}_{i,j}\) be the unlabeled pool in the i-th stage of active learning using the j-th acquisition function and \({\mathcal {S}}_{i,j} \subseteq {\mathcal {U}}_{i,j}\) be the set of data selected by j at this i-th stage. Given a characteristic function \(\phi\) that, given an input data x returns the value of the characteristic under study for x, we generate two sets of variables \(\phi \left( {\mathcal {S}}_{i,j}\right)\) and \(\phi \left( {\mathcal {U}}_{i,j}\right)\) that contain the characteristic value of all data in \({\mathcal {S}}_{i,j}\) and \({\mathcal {U}}_{i,j}\), respectively. We, then, estimate the probability density functions (PDFs) of \(\phi \left( {\mathcal {S}}_{i,j}\right)\) and \(\phi \left( {\mathcal {U}}_{i,j}\right)\) using the histogram method with 50 bins [69], yielding \(PDF_{\phi \left( {\mathcal {S}}_{i,j}\right) }\) and \(PDF_{\phi \left( {\mathcal {U}}_{i,j}\right) }\), respectively. We calculate the difference between two PDFs based on the Jensen-Shannon divergence (JSD)—an established method to measure the divergence between two distributions [70]. This difference is, therefore, given by

$$\begin{aligned} d_{i,j}=\mathrm {JSD}\left( PDF_{\phi \left( {\mathcal {S}}_{i,j}\right) }\parallel PDF_{\phi \left( {\mathcal {U}}_{i,j}\right) }\right) \end{aligned}$$
(12)

Second, we measure the correlation at each stage for every characteristic between (a) the robustness of models produced by different acquisition functions and (b) the characteristic bias of data selected by corresponding functions. The Pearson correlation is used to measure this correlation, which captures linear relationships between two variables. Hence, for each stage i and each characteristic, a correlation coefficient \(r_i\) is calculated by [71]:

$$\begin{aligned} r_i=\mathrm {Pearson}\left( d_i, a_i\right) , d_i=\left\{ d_{i,j}\right\} _{j\le 11}, a_i=\left\{ a_{i,j}\right\} _{j\le 11} \end{aligned}$$
(13)

where \(a_{i,j}\) denotes the robustness of the model produced at the i-th stage by the j-th acquisition function.

Results Figure 4 shows a heat map of the Pearson correlation coefficients \(\left\{ r_i\right\}\) obtained from SVHN/VGG8, and the Auto attack. All correlations are negative as the colors are always red, which supports our hypothesis that characteristic biases have a negative effect on robustness. In other words, selecting data that are more representative of the entire unlabeled pool is more likely to produce a robust DNN. Among all the characteristics, entropy exhibits the strongest negative correlations. Over the 18 stages, the maximum and average correlations obtained by entropy are -0.84 and -0.65 (strong negative correlations), whereas the lowest correlations are achieved by margin (maximum -0.76 and -0.56 on average). Entropy, therefore, seems the most capable metric to drive the selection of a representative set of data that will be used to produce robust models.

Fig. 4
figure 4

Heat map of Pearson correlation between bias of characteristics adversarial robustness against the Auto attack. x-axis: stages in active learning; y-axis: characteristics. The size of each square corresponds to the magnitude of the correlation it represents. Dataset: SVHN; DNN: VGG8

Conclusions The inherent bias of acquisition functions towards the “most informative” data has strong negative correlations with robustness. To improve robustness, an ideal function should rather select data that have a representative level of informativeness. Among the informative characteristics, entropy bias has the strongest negative correlations.

6 Density-based robust sampling with entropy

6.1 Definition and algorithm

Inspired by our previous results, we propose a new acquisition function for ARAL: the density-based robust sampling with entropy (DRE). The key principle of DRE is to maintain a balance between selected data and unlabeled pool in terms of entropy distribution (represented by entropy PDF as defined in the previous section).

Concretely, in the i-th stage, let \({\mathcal {L}}_i\) and \({\mathcal {U}}_i\) be the labeled pool and unlabeled pool, respectively. A DNN parameterized by \(\theta _{{\mathcal {L}}_i}\) is trained on \({\mathcal {L}}_i\). First, for each data \(x\in {\mathcal {U}}_i\), we calculate the entropy (by Eq. (4)) of the prediction by this DNN and obtain a set of entropy scores:

$$\begin{aligned} {\mathcal {A}}_i=\left\{ H\left( y,x;\theta _{{\mathcal {L}}_i}\right) , x\in {\mathcal {U}}_i\right\} \end{aligned}$$
(14)

Afterwards, we estimate the entropy distribution, \(PDF_{{\mathcal {A}}_i}\), of \({\mathcal {U}}_i\) via the histogram method [69]. Correspondingly, each data is divided into a certain density interval based on its entropy score. Finally, we randomly select data from each density interval and add to \({\mathcal {L}}_i\). The number of data to be selected from the j-th density interval is determined by:

$$\begin{aligned} n_j=n*PDF_{{\mathcal {A}}_i}\left( j\right) \end{aligned}$$
(15)

n is the number of labeling in each query (see New in Table 3). In this stage, \(\theta _i\) will be updated by adversarial training on \({\mathcal {L}}_i\) and the selected data.

6.2 Effectiveness of DRE in active learning

DRE is evaluated using the same datasets, DNNs, and experimental protocol that are previously used in Section 5.

Fig. 5
figure 5

Box plots of test accuracy (%) and robustness (%) against Auto attack of 12 acquisition functions in ARAL. The higher, the better. Dataset: SVHN. DNN: VGG8

Table 5 Comparison of accuracy (Acc, %) and robustness (Rob, %) against the Auto attack between DRE and 11 acquisition functions

Results Figure 5 shows the box plots of accuracy and adversarial robustness by each acquisition of SHVN. Table 5 shows the accuracy and the robustness of the models produced by each acquisition function at one-third (1/3) and at the end (“last stage”) of the ARAL process of all datasets and DNNs. DRE is competitive in terms of accuracy. At the end of the process, our acquisition function achieves the best accuracy for Fashion-MNIST and CIFAR-10/PreActResNet18, which is close to the best for MNIST and CIFAR-10/ResNet18, and ranks at about the middle for SVHN and CIFAR-10/VGG16. When it comes to robustness, DRE stands out and outperforms all the other acquisition functions for all models and datasets except CIFAR-10/VGG16 where it yields the robustness slightly lower than random sampling (− 0.39%). On the other datasets and models, DRE improves the robustness over random sampling on the other datasets and models by 0.75% up to 3.84%.

We performed statistical tests to assess the statistical significance of the robustness improvement that DRE brings on all datasets, models, and learning stages. A Wilcoxon signed-rank test rejects the null hypothesis at the significance level of 5.00E-02 that there is no difference between DRE and any other acquisition function, with a p-value ranging from 8.15E-10 to 2.25E-03.

Conclusions DRE succeeds in consistently achieving better robustness than SOTA acquisition functions—and does so while remaining competitive in terms of accuracy. DRE, therefore, forms the baseline for ARAL.

6.3 DRE for test selection

While we previously showed that ARAL can train robust models from scratch, we study the adjacent problem of DL testing and retraining (DL T &R). DL T &R starts from a trained model and attempts to generate new test data that the model misclassifies. In turn, it uses these data to improve the model (through retraining) [3]. Test generation methods generally proceed in two steps: 1) selecting clean data to start from and 2) generating test data from the clean data. For the second step, one can rely on dedicated methods that research has proposed [72] or simply on adversarial attacks as [51].

The acquisition functions in active learning can also serve for test data selection (the first step mentioned above) and combine adversarial attacks to generate data for retraining. Hence, we investigate the capability of these functions to select data with which the robustness of the model will improve after retraining.

We utilize the same datasets and DNNs as previously. We pre-train the DNNs using standard training on the entire training set. Then, following established experimental protocols [3], we use a given acquisition function to select a budget test data (1% to 10% of the test set). We then generate adversarial examples from the selected data and retrain the DNNs using both the training data and the generated adversarial examples. Finally, we evaluate both the test accuracy and robustness of retrained DNNs.

Results Table 6 presents the accuracy and robustness of each model, before and after retraining, against the Auto attack. We draw the first conclusion that all acquisition functions are applicable as metrics in test selection. Compared with the baseline where no test data are added for training, they all manage to improve the robustness against the Auto attack by up to 3.23% for SVHN and the three CIFAR-10 models. However, for MNIST and Fashion-MNIST, only DRE can increase the robustness of retrained DNNs by up to 2.58%. Under different sizes of budget, DRE achieves competitive accuracy (the difference is less than 2.03%) compared with the other 11 acquisition functions. Besides, concerning the robustness, DRE outperforms the others in most (116 of 144) cases by up to 8.21%.

Table 6 Comparison of accuracy (Acc, %)and robustness (Rob, %) against the Auto attack with different budgets (1%, 4%) in test selection

Conclusions Although all acquisition functions are viable for DL testing and retraining, DRE yields higher robustness (by up to 8.21%) than any of the other functions.

7 Threats to validity

The internal threat of our work mainly comes from the implementation of compared acquisition functions and evaluation metrics. Regarding the 10 functions (except random sampling), we searched for the original implementations provided by the corresponding authors or the implementation by the other researchers, then carefully modified them into our framework based on PyTorch. We follow the suggestions in their implementations or publications for the involved parameters. In the robustness evaluation and statistical analysis, we adopt popular libraries, such as Torchattacks and SciPy.

The external threats to validity are related to the selection of datasets, DNNs, and compared acquisition functions. For the datasets and DNNs, we employ four publicly available datasets and six DNN architectures that are widely studied in the experiments of different works about active learning. Regarding the compared acquisition functions, we include random sampling and ten well-designed functions in the literature, which cover the basic and recently proposed ones. Besides, these ten functions exist in both the machine learning and software engineering (e.g. MCP and DeepGini) communities.

The construct threats might be from the randomness and the robustness evaluation measures. To reduce the threat by randomness, each experiment is repeated three times and we present the average. To gauge the adversarial robustness, we adopt two commonly used attacks (PGD and Auto attack, which are the current standard for robustness evaluation in the literature) and added the black-box gradient-free square attack. Though we show the results for the Auto attack only, the results for the other attacks (available on our companion website) corroborate our findings. Indeed, as the Auto attack is stronger than both the PGD attack and square attack, the robustness obtained by an acquisition function is lower when using the Auto attack to evaluate. Correspondingly, the relative difference will be slightly larger. For example, DRE achieves better robustness than random sampling by up to 6.89% against the PGD attack (versus better by up to 3.84% robustness against the Auto attack). All detailed results are provided on our companion website [55]. Using other adversarial attacks, such as the fast gradient sign method (FGSM) [36] and iterative FGSM (i-FGSM) [73], will give the same conclusion.

8 Conclusion

We investigated the use of active learning for building robust deep learning systems. We conducted a comprehensive study with 11 existing acquisition functions and 15105 trained DNNs and revealed that, in adversarial-robust active learning (ARAL), random sampling achieves better robustness than existing functions but fails on accuracy. Via our analyses, we demonstrated that the selected data for training a robust model should be representative of the entire set and proposed DRE, the first acquisition function dedicated to RAL. Extensive experiments have demonstrated that DRE achieves better robustness than the 11 acquisition functions and achieves competitive accuracy. Besides, the DL testing and retraining experiments have demonstrated that DRE is suitable for this task and still outperforms the other acquisition functions. We hope that our formulation of the ARAL process, the experimental protocol we put in place, and the baseline we propose (DRE) will altogether inspire future research on developing robust DL models.

The main limitation of this work is the selection of datasets and DNNs and we mainly focused on the image domain. In the future work, we will investigate DRE for other domains, such as text classification and source code analysis. In addition, DRE is developed based on our experimental observations. We would like to seek theoretical strategies to perform adversarial-robust active learning.

9 Supplementary information

We release all our source code publicly available at https://github.com/testing-cs/robustAL.git. The full results are available at our project site https://sites.google.com/view/robust-al/.