1 Introduction

The changes in the manuscript are highlighted in blue. By uncommenting line 99 and commenting line 98 in the Latex Sources a version without any highlight can be easily produced.

The biggest revisional changes are:

  • incorporating the related work chapter into the background and other fitting places

  • added the new Section 5.6 "Limitations"

  • many minor changes (not highlighted in blue), often to rephrase overly complicated sentences and to fix grammar errors

The most common use case of Machine Learning (ML) is supervised learning, which inherently requires a labeled dataset to demonstrate the desired outcome to the (ML) model. This initial step of acquiring a labeled dataset can only be accomplished by often rare-to-get and costly human domain experts; automation is not possible as the automation via ML is exactly the task which should be learned. For example, the average cost for the common label task of segmenting a single image reliably is 6,40 USDFootnote 1. At the same time, recent advances in the field of Neural Network (NN) such as Transformer-encoder Vaswani et al (2017) models for Natural Language Processing (NLP) (with BERT Devlin et al (2019) being the most prominent example) or Convolutional Neural Networks (CNN) LeCun et al (2015) for computer vision resulted in huge Deep NN which require even more labeled training data. Reducing the amount of labeled data is therefore a primary objective of making ML more applicable in real-world scenarios. The focus of this paper is on the NLP domain and Transformer-encoder NN respectively, but the proposed methods can also be applied without any further work to other deep NN models and domains. A concrete real-world use-case scenario our research targets at is text classification, f. e. categorization of legal documents such as regulatory documents to ensure adherence to legal requirements.

Noting that deep NN require a large amount of labeled data, Active Learning (AL) is a popular method to reduce the required human effort in creating a labeled dataset, by reducing labeling of nearly identical, and therefore redundant samples. AL is an iterative process, which step-by-step decides, which samples to label first, based on the existing knowledge in the form of the currently labeled samples. In each AL-cycle a new subset of unlabeled samples is actively selected for labeling by human annotators, thus, the set of labeled samples growths continuously and the quality of the learned ML-model is gradually improved. Due to the iterative selection, the already existent knowledge of the so-far labeled data can be leveraged to select the most promising samples to be labeled next. The goal is to reduce the amount of necessary labeling work while keeping the same model performance. AL achieves this by preventing the annotation of redundant samples, which the model has already learned to properly represent. The challenge of applying AL is an almost paradoxical problem: how to decide which samples are most beneficial to the ML model, without knowing the label of the samples, since this is exactly the task to be learned by the to-be-trained ML model.

Despite successful application in a variety of domains (Gonsior et al, 2020; Gal et al, 2017; Lowell et al, 2019), AL fails to work for very deep NN such as Transformer-encoder models, rarely beating pure random sampling. The common explanation (Karamcheti et al, 2021; Gleave and Irving, 2022; Sankararaman et al, 2022; D’Arcy and Downey, 2022) is that AL methods favor hard-to-learn samples, often simply called outliers, which therefore neglects the potential benefits gained from AL. Another potential explanation – to the best of our knowledge not yet covered in the AL literature – could be the calculation of the uncertainty of the NN. Nearly all AL strategies rely on a method measuring the uncertainty of the to-be-trained ML model in its prediction as probabilities. The reasoning behind is that those samples with high model uncertainty are the most useful ones to learn from, and should therefore be labeled first. For NN typically the softmax activation function is used for the last layer, and its output is interpreted as the probability of the confidence of the NN. But interpreting the softmax function as the true model confidence is a fallacy Pearce et al (2021). We therefore compare eight alternative Uncertainty-measures for AL in an extensive end-to-end evaluation of fine-tuning Transformer models for seven common classification datasets.

Our main contributions are:

  • An empirical comparison of eight alternative Uncertainty-measures to the vanilla softmax function in the context of AL, applied for fine-tuning Transformer-encoder models.

  • A proposal of the novel and easy to implement method Uncertainty-Clipping (UC) of mitigating the negative effect of uncertainty based AL methods of favoring outliers.

  • A systematic evaluation of the Uncertainty-Clipping method demonstrating how it improves nearly all Uncer-tainty-measures.

Fig. 1
figure 1

Standard Active Learning Cycle including our proposed Uncertainty-Clipping (UC) to influence the uncertainty based ranking (using the probability \(P_\theta (y|x)\) of the learner model \(\theta \) in predicting class y for a sample x) by ignoring the top-k results

The remainder of this paper is structured as follows: In Section 2, we briefly explain AL, the Transformer model architecture, and the softmax function. Section 3 presents the alternative Uncertainty-measures, Section 4 describes our experimental setup. We present our results and discussion in Section 5, and conclude in Section 6.

2 Active Learning Basics

This section introduces the standard AL cycle in Section 2.1, and gives an overview of the three categories of AL strategies: uncertainty-based strategies in Section 2.2, diversity-based strategies in Section 2.3 and combined strategies in Section 2.4. We conclude by explaining our reasoning to focus on improving uncertainty-based strategies in this work in Section 2.5.

2.1 Active Learning Cycle

Supervised learning techniques inherently rely on an annotated dataset. AL is a well-known technique for saving human effort by iteratively selecting exactly those unlabeled samples for expert labeling that are the most useful ones for the overall classification task Settles (2012). The goal is to train a classification model \(\theta \) that maps samples \(x \in \mathcal {X}\) to a respective label \(y \in \mathcal {Y}\); for the training, the labels \(\mathcal {Y}\) have to be provided by an oracle, often one or multiple human annotators. Figure 1 shows a standard pool-based AL cycle: Given a small initial labeled dataset \(\mathcal {L}= \{(x_i,y_i)\}_{i=0}^n\) of n samples \(x_i \in \mathcal {X}\) and the respective label \(y_i \in \mathcal {Y}\) and a large unlabeled pool \(\mathcal {U}= \{x_i\}, x_i \not \in \mathcal {L}\), an ML model called learner \(\theta : \mathcal {X} \mapsto \mathcal {Y}\) is trained on the labeled set. A query strategy \(f:\mathcal {U}\longrightarrow Q\) then subsequently chooses a batch of \(b\) unlabeled samples \(Q\), which will be labeled by the oracle (human expert) and added to the set of labeled data \(\mathcal {L}\). This AL cycle repeats \(\tau \) times until a stopping criterion is met.

2.2 Uncertainty-based AL

Commonly used AL query strategies, which are relying on informativeness, use the uncertainty of the learner model \(\theta \) to select the AL query. The uncertainty is defined as the inverse confidence/probability of the learner \(P_{\theta }(y|x)\) in classifying a sample x with the label y. The idea behind is to label samples in those regions the learner model is most uncertain about, and thereby intentionally decreasing the model’s overall uncertainty. The most simple informativeness AL strategy is Uncertainty Least Confidence (LC) Lewis and Gale (1994). This strategy selects those samples that the learner model is most uncertain about, i.e. where the probability \(P_{\theta }(\hat{y}|x)\) of the most probable label \(\hat{y}\) is the lowest:

$$\begin{aligned} f_{LC}(\mathcal {U}) =\underset{{x \in \mathcal {U}}}{\arg \max } \left( 1-P_{\theta }(\hat{y}|x)\right) \end{aligned}$$
(1)

A variant of the uncertainty strategy is Uncertainty Max-Margin (MM) Scheffer et al (2001), which selects those samples, where the difference between the certainty for the most \(\hat{y}_1\) and second most probable class \(\hat{y}_2\) is the lowest:

$$\begin{aligned} {{f}_{MM}}(\mathcal {U}) = \underset{{x \in \mathcal {U}}}{\arg \min } \left( P_{\theta }(\hat{y}_1|x)-P_{\theta }(\hat{y}_2|x)\right) \end{aligned}$$
(2)

Another variant is Uncertainty Entropy (Ent) Shannon (1948), where the entropy of the label distribution is used to measure the uncertainty of the learner:

$$\begin{aligned} {f_{Ent}}(\mathcal {U}) = \underset{x \in \mathcal {U}}{\arg \max } \left( - \underset{i}{\sum }\ P_{\theta }(\hat{y_i}|x)log P_{\theta }(\hat{y_i}|x)\right) \end{aligned}$$
(3)

All uncertainty-based strategies have in common that they need the confidence of the learner model in a quantified form to rank the unlabeled samples from the most certain to the most uncertain samples. For the case of deep Neural Network (NN), like Transformer-Encoder based language models, the softmax activation function is used.

Query-by-committee (QBC) Seung et al (1992), another widely used strategy, uses a committee of multiple learner models to label those samples first, where the models disagree the most about if they are uncertain about the samples. A large drawback of this strategy is the increased runtime, especially when using large NNs as learner model.

Recent uncertainty-based strategies such as Cartography Active Learning (CAL) Zhang and Plank (2021) take the uncertainty of the last n-iterations into account, or consider the joint informativeness of a complete batch of multiple queried samples like in the Bayesian Active Learning by Disagreement (BALD)-based strategies Houlsby et al (2011); Kirsch et al (2019); Siddhant and Lipton (2018).

Fig. 2
figure 2

Exemplary visualization of a two-dimensional XOR-like dataset

A common problem of uncertainty-based strategies are XOR-like datasets Baram et al (2004); Konyushkova et al (2015), exemplarily displayed in Fig. 2 for a two-dimensional dataset. These datasets resemble a chessboard with alternating lines of changing classes. If the learner model has identified a classification boundary, uncertainty-based strategies will label those samples, close to the boundary. The assumption that samples, which lie far behind the boundary, belong certainly to the same class, is especially wrong for XOR-based datasets, due to the chessboard-like alternation between classes. Therefore the labeled set may never contain samples from the clusters of the other classes, which are behind the classification boundary resulting in certain parts of the dataset being left unexplored.

2.3 Diversity-based Active Learning Strategies

Diversity-based strategies aim for creating a labeled set, which contains representative samples from the overall sample pool. Core-Set based strategies such as Coleman et al (2020) and Sener and Savarese (2018) rely on clustering approaches to accomplish that goal. Core-Se Coleman et al (2020) chooses those samples, which are closest to the set of already labeled samples. As noted in the survey paper Zhan et al (2021), those strategies perform better when more labeled data is available, as it takes a while to create a representative labeled set. Their strength shows in later AL cycles, or with a larger initially labeled pool. In contrast to uncertainty-based strategies, which label along the decision boundary, diversity-based strategies label across the complete vector space. For specific datasets, such as XOR-like datasets (see f. e. the results in  Baram et al (2004); Konyushkova et al (2015)) this proves to be much more effective than uncertainty-based sampling.

2.4 Combined Strategies

Recent focus in the AL research is on combining both uncertainty and diversity-based approaches. One variant is to learn an optimal AL strategy like Active Learning by Imitation Learning (ImitAL) Gonsior et al (2022), Active Learning by Learning (ALBL) Hsu and Lin (2015), or Learning Active Learning (LAL) Konyushkova et al (2015, 2018). Another variant is to combine both selection criterias using a manually defined heuristical algorithm, like in Self-paced Active Learning (SPAL) Tang and Huang (2019), QUerying Informative and Representative Examples (QUIRE) Huang et al (2010), or Batch-mode Discriminative and Representative Active Learning (BMDR) Wang and Ye (2015). One common neglected aspect oft these advanced AL strategies is their very high runtime, which is often especially bad for large datasets Gonsior et al (2022), though that is where the greatest potential for AL lies.

2.5 Focus on improving Uncertainty-based AL Strategies

This research paper focuses on improving uncertainty-based AL strategies because they are by far the most widely used and best performing strategies Schröder et al (2022); Yoo and Kweon (2019); Zhan et al (2021), and the runtime complexity renders most alternative strategies unsuitable for AL for NN.

The survey paper by Schröder and Niekler (2020) about AL for text classification emphasizes that the majority of query strategies used in recent NN-based AL are uncertainty based AL strategies. Diversity-based strategies often have the problem of only functioning well if the underlying sample vector space contains well-formed clusters and require larger amount of labeled data to work well Zhan et al (2021). In addition to that the runtime complexity is very high Schröder et al (2022). Widely used techniques such as subsampling the unlabeled set reduces the effectiveness of AL. The same reason prevents the more advanced combined strategies from being used in practice. Their runtime is simply too long to be applicable for large datasets, as they occur in the domain of text classification Gonsior et al (2022).

3 Uncertainty-measures

In the following, we first discuss the suitability and problems of the softmax activation function as an Uncertainty-measure (Section 3.1), present alternative methods in Section 3.2, and end with our novel meta-strategy Uncertainty-Clipping, which significantly enhances several Uncertainty-measures (Section 3.3).

3.1 Softmax as Uncertainty-measure

All uncertainty-based AL strategies have in common that they need the uncertainty of the learner model in a quantified form to rank the unlabeled samples. In this paper, this will be called Uncertainty-measure. The inverse of an Uncertainty-measure, the confidence-probability of an ML model, should reflect, how probable it is that the model’s own predictions are true. For example, a confidence of \(70\%\) should mean a correct prediction in 70 out of 100 cases. Due to the property of having 1 as the sum for all components the softmax function is often used as a makeshift probability measure for the certainty of NNs:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)}, \text { for } i=1,\dots , K \end{aligned}$$
(4)

The output of the last neurons i before entering the activation functions is called logit, and denoted as \(z_i\), K denotes the amount of neurons in the last layer.

But as has been mentioned in the past by other researchers Lakshminarayanan et al (2017); Weiss and Tonella (2022); Gleave and Irving (2022); Sankararaman et al (2022); D’Arcy and Downey (2022), the training objective for NNs is purely to maximize the value of the correct output neuron, not to create a true confidence-probability. An inherent limitation of the softmax function is its inability to have – in the theoretical case – zero confidence in its prediction, as the sum of all possible outcomes always equals 1. Previous works have indicated that softmax based confidence is often overconfident Gal and Ghahramani (2016).

Pearce et al (2021); Hein et al (2019) deeply investigate the foundations of the vanilla softmax function as confidence-probability. They show that especially NNs using a typical ReLU activation function for the inner layers can be easily tricked into being overly confident in any prediction by simply scaling the input \(x\) with an arbitrarily large value \(\alpha >1\) to \(\tilde{x} = \alpha x\)  Pearce et al (2021); Hein et al (2019).

3.2 Alternative Uncertainty-measures

Even though it is known that an Uncertainty-measure is a crucial component for uncertainty-based AL Lakshminarayanan et al (2017); Blundell et al (2015), especially in the context of deep NN like Transformer models Schröder et al (2022), few research has been done on solely comparing Uncertainty-measures, with the goal of AL for Transformer-encoder models in mind. We selected seven methods from the literature suitable as alternative Uncertainty-measures of Deep NNs such as Transformer-encoder models. They can be divided into five categories Gawlikowski et al (2021): a) single network deterministic methods, which deterministically produce the same result for each NN forward pass (Inhibited Softmax (IS) Możejko et al (2018), TrustScore (TrSc) Jiang et al (2018) and Evidential Neural Networks (Evi) Sensoy et al (2018)), b) Bayesian methods, which sample from a distribution and result therefore in non-deterministic results (Monte-Carlo Dropout (MC) Gal and Ghahramani (2016)), c) ensemble methods, which combine multiple deterministic models into a single decision (Softmax Ensemble Seung et al (1992) ), d) calibration methods, which calibrate the softmax function (Label Smoothing (LS) Szegedy et al (2016) and Temperature Scaling (TeSc) Zhang et al (2020)), and e) test-time augmentation methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction for the augmented samples. The last category is a subject of future research as we could not find a subset of data augmentation techniques that reliably worked well for our use case among different datasets.

More elaborate AL strategies like BALD Kirsch et al (2019) or QUIRE Huang et al (2010) not only focus on the confidence-probability measure, but also make use of the vector space to label a diverse training set, including also regions far away from the classification boundary. As the focus of this paper is on purely evaluating the influence of the confidence prediction methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Confidence.

In the following, the core ideas of the individual methods are briefly explained. More details, reasonings, and the exact formulas can be found in the original papers.

Inhibited Softmax (IS). The Inhibited Softmax method Możejko et al (2018) is a simple extension of the vanilla softmax function by an additional constant factor \(\alpha \in \mathbb {R}\), which enhances the effect of the absolute magnitude of the single logit value \(z_i\) on the softmax output:

$$\begin{aligned} \sigma (z)_i = \frac{exp(z_i)}{\sum ^K_{j=1}exp(z_j)+exp(\alpha )} \end{aligned}$$
(5)

To ensure that the added fraction is not removed during the training process, several changes to the NN have to be made, including: a) removing the bias b from the input of the neuron activation function, and b) extending the loss function by a special evident regularisation term.

TrustScore (TrSc). The TrustScore Jiang et al (2018) method uses the set of available labeled data to calculate a TrustScore, independent of the NN model. In a first step, the available labeled data is clustered into a single high density region for each class. The TrustScore ts of a sample \(x\) is then calculated as the ratio of the distance from \(x\) to the cluster of the nearest class \(c_{closest}\), and the distance to the cluster of the predicted class \(\hat{y}\):

$$\begin{aligned} ts_x= \frac{dist(x, c_{closest})}{dist(x, c_{\hat{y}})} \end{aligned}$$
(6)

Therefore, the TrustScore is higher when the cluster of the nearest class is further away from the cluster of the most probable class, indicating a potentially wrong classification. The distance metric as well as the calculation of the clusters is based on the k-nearest neighbors algorithm.

Evidential Neural Networks (Evi). Evidential Neural Networks Sensoy et al (2018) treat the vanilla softmax outputs as a parameter set over a Dirichlet distribution. The prediction acts as evidence supporting the given parameter set out of the distribution, and the confidence-probability of the NN reflects the Dirichlet probability density function over the possible softmax outputs.

Monte-Carlo Dropout (MC). Monte-Carlo Dropout Gal and Ghahramani (2016) is a Bayesian method that uses the NN dropout regularization method to construct for “free” an ensemble of the same trained model. Dropout refers to randomly disabling neurons during the training phase, originally with the aim of reducing overfitting of the to-be-trained network. For Monte-Carlo Dropout, the dropout method is applied during the prediction phase. As neurons are disabled randomly, this results in a large Gaussian sample space of different models. Therefore, each model, with differently dropped out neurons, results in a potentially different prediction. Combining the vanilla softmax prediction using the arithmetic mean produces a combined Uncertainty-measure.

Softmax Ensemble. The softmax ensemble approach uses an ensemble of NN models, similar to Monte-Carlo Dropout. The predictions of the ensemble can be interpreted as a vote upon the prediction. The disagreement among the votees acts then as the Uncertaintymeasure and can be calculated in two ways, either as Vote Entropy (VE), or as Kullback-Leibler Divergence (KLD) McCallumzy and Nigamy (1998):

$$\begin{aligned} VE(x) = -\sum _i\frac{V(\hat{y_i}, x)}{K}log\frac{V(\hat{y_i}, x)}{K} \end{aligned}$$
(7)

with K being the number of ensemble models, and \(V(x, \hat{y_i})\) denoting the number of ensemble models assigning the class \(\hat{y_i}\) to the sample \(x\). The complete equation to calculate the KLD is omitted for brevity. Using an ensemble of softmax models inside of an AL strategy results in the uncertainty-based Query-by-committee Seung et al (1992) AL strategy.

Fig. 3
figure 3

Histogram distribution of exemplary uncertainty values for a single AL iteration of TREC-6 dataset before Uncertainty-Clipping. The x-axis ranges from 0, full certainty, to 1, full uncertainty

Fig. 4
figure 4

Displayed Thresholds used for the different Uncertainty-Clipping methods over a distribution of exemplary uncertainty values for a single AL iteration. The x-axis ranges from 0, full certainty, to 1, full uncertainty

Temperature Scaling (TeSc). Temperature Scaling Zhang et al (2020) is a model calibration method that is applied after the training and changes the calculation of the softmax function by inducing a temperature \(T>0\):

$$\begin{aligned} \sigma (z_i) = \frac{exp(z_i/T)}{\sum ^K_{j=1}exp(z_j/T)} \end{aligned}$$
(8)

For \(T=1\) the softmax function stays the same as the original version, for values \(T<1\) the softmax output of the largest logit is increased, and for values of \(T>1\) (which is the recommended case in using Temperature Scaling) the output of the most probable logit is decreased. This has a dampening effect on the overall confidence. The value of the temperature T is computed empirically using the existent labeled set of samples during application time. The parameter is therefore different for each AL iteration.

Label Smoothing (LS). Label Smoothing Szegedy et al (2016) removes a fraction \(\alpha \) of the loss function per predicted class and distributes it uniformly among the other classes by adding \(\frac{\alpha }{K-1}\) to the other loss outputs, with K being the number of classification classes. In contrast to Temperature Scaling, Label Smoothing is not applied after a network has been trained, but directly during the training process due to a modified loss function.

3.3 Uncertainty-Clipping (UC)

Algorithm 1
figure a

Top-k Clipping

Algorithm 2
figure b

First Peak Clipping

Algorithm 3
figure c

First Valley Clipping

The aforementioned Uncertainty-measures can directly be used in AL strategies to sort the pool of unlabeled samples to select exactly those samples for labeling that have the lowest confidence/highest uncertainty. As repeatedly reported by others Karamcheti et al (2021); Gleave and Irving (2022); Sankararaman et al (2022); D’Arcy and Downey (2022), AL is rarely able to outperform pure random sampling when used to fine-tune Transformer-encoder models. Purely labeling based on the uncertainty score results in labeling many outliers/hard-to-learn-from samples, which results in an often bad classification performance.

Figure 3 displays exemplary histograms of the prediction probabilities/uncertainty values for the two methods Label Smoothing and TrustScore for a single AL iteration for a standard NLP dataset, the TREC-6 dataset Footnote 2. Both distributions have a characteristic small peak of uncertainty to the far right, and we theorize that these are the outliers which are labeled first. Additionaly, we display a passive classifier, an NN model trained on the full available training dataset, to further illustrate the uncertainty values in comparison. Such a model is obviously very confident in its predictions due to much more labeled information.

We propose three easily implementable methods on improving uncertainty-based AL strategies by preventing potentially harmful outliers from being selected for labeling. An uncertainty-based AL strategy always selects those samples for labeling first, where the uncertainty is the highest. The core idea behind the uncertainty-clipping is to ignore the displayed small peaks to the far right in Fig. 3. These are the most uncertain samples, and supposedly outliers that have ambiguous labels. Therefore, a high uncertainty is expected, but not a good indicator for a prioritized labeling of these samples, as an ML model, which is primarily being trained on ambiguous outliers, will often need a lot of labeled data to separate the target classes correctly. Ignoring the most uncertain outlier samples results in selecting the second-most uncertain samples for labeling, which in most cases contain almost as much classification boundary information as the most uncertain samples, and in the case of outliers even more. The challenge here is to detect potential outliers without knowing the true labels, while at the same time to prevent too many samples from being selected for labeling. The different thresholds used for detecting the clipping thresholds are displayed in Fig. 4.

This idea of throwing away some labels to improve the accuracy has also been noted to function well for Transformer-models by Sankararaman et al (2022), where they propose a modified Bayesian-theory based Monte-Carlo Dropout-dropout Uncertainty-measure variant. In Karamcheti et al (2021) the authors also noted the highly negative impact of outliers as hard-to-learn-from samples for Transformer models through an extensive ablation study, and propose as future research to develop methods, to ignore these outliers. On a similar note, D’Arcy and Downey (2022) find that AL methods, when applied to Transformer models, often perform inconsistently due to labeling too many unlearnable outliers. Their solution to the problem is training multiple models, resulting in an ensemble approach. Our runtime comparison, detailed in Section 5.5, indicates that ensemble approaches should be used with caution due to very high runtimes, compared to all other methods.

Top-k Clipping. The first proposed variant is the most simple method, where we use a fixed threshold to ignore everything above the k-th percentile of the most uncertain samples, outlined in Algorithm 1. First, all the uncertainty-clipping methods calculate the uncertainty values \({\textbf {u}}\) (line 1 and 2). Afterwards, the top-k uncertainty values are removed (line 3 and 4). This method can be used in combination with any AL strategy using uncertainty for ranking the pool of unlabeled samples. A fixed threshold has the advantage of a very low implementation overhead, but the disadvantage of being very dependent on the parameter k. A too low value may ignore too few samples, a too high too many. The following two methods aim to circumvent this restriction.

First Peak Clipping. Other indicators for filtering out uncertainty distributions are local maxima and minima. The goal is to determine the first peak/local maxima in the uncertainty distribution, which is potentially being caused by the outlier samples. Algorithm 2 illustrates our proposed implementation of this idea. First, we calculate a kernel density estimation to get a smooth probability density function from the distribution of the uncertainty values (line 4). Based on this, we calculate the first local maxima from the right (line 5), as not all uncertainty distributions do have a second peak to the right. Additionally, to prevent from filtering out too many samples, we limit this method using the top-k-clipping threshold from the first method (line 6 and 7).

A disadvantage of this and the following method is the dependency on the existence of local maxima that not every uncertainty values distributions exhibits.

First Valley Clipping. The idea is basically very similar to the first peak clipping method, but instead of taking the maximum of the first hill from the right, we use the valley after the first hill as threshold, outlined in Algorithm 3.

4 Experimental Setup

Details to reproduce our evaluation are provided in this Section. In support of the the reproducability initiative, we are offering other researchers the re-use of our work by making our source code fully publicly available on GitHub Footnote 3.

4.1 Setup

We extended the AL framework small-text Schröder et al (2021), tailored for the use-case of applying AL to Transfosr-mer-encoder networks. Some of the aforementioned methods such as Label Smoothing can be applied directly during the training of the network and have a positive effect on the training outcome, whereas others like Monte-Carlo Dropout are applied after the training is complete. As we are only interested in evaluating the effect of alternative Uncertainty-measures instead of the potentially positive influence on the classification quality, we are effectively training two Transformer-encoder models simultaneously. One is the original vanilla Transformer-encoder model with a linear classifier and a softmax on top of the CLS embedding. This one is used for the class predictions and for calculating the classification quality. The other one is solely used for the AL selection process, and includes the implemented alternative Uncertainty-measures. Even though this adds computational overhead to our experiments, it enables us to evaluate the effect of the Uncertainty-measure for AL independently of potentially other positive – or negative – effects on the classification quality.

Parameters for the compared methods were selected empirically using hyper parameter tuning. The constant factor \(\alpha \) for Inhibited Softmax was set to 1.0, for TrustScore the k for the k-nearest neighbor calculation to 10. We used an ensemble of 50 softmax-based Transformer-encoder models for Monte-Carlo Dropout, each one with a different seed for the random-number-generator. For the softmax ensemble we could only use 5 different Transformer-encoder models with the vanilla softmax activation function, as the runtime was drastically higher compared to Monte-Carlo Dropout (MC). For Label Smoothing the fraction \(\alpha \) was set empirically fixed to 0.2. For Temperature Scaling 1.000 different values for T between 0 and 10 were tested; the temperature resulting in the smallest cross-entropy was finally used. Whenever possible, we used the original implementations of the methods with slight adaptations to make them work together with Transformer-encoder models and the small-text framework. We use the original \(BERT_{base}\) Devlin et al (2019) model and the updated version \(RoBERTa_{base}\) Liu et al (2019) as Transformer-encoder models.

4.1.1 Active Learning Simulation

To evaluate the effectiveness of theUncertainty-measure we are simulating AL in the following way: For a given labeled dataset, we start with an initially labeled set of 25 samples, and perform afterwards 20 iterations of AL, ignoring the known labels for the other samples, following the procedure of Schröder et al (2022). In each iteration, a batch of 25 samples is selected by the AL strategy for labeling, which is simulated using the known ground-truth labels. Each simulation was repeated 10 times using different initially labeled samples to ensure statistical significance.

Table 1 Abbreviation for the evaluated AL strategies using the alternative Uncertainty-measure including several baselines
Table 2 Information about the datasets used in the experiments

As baselines, we decided to deliberately only include uncertainty-based AL strategies. Firstly, the scope of this work is to evaluate improvements for the uncertainty calculation part of AL strategies. More advanced strategies have the potential to overshadow the influence of the changed Uncertainty-measures. Secondly, according to recent AL surveys such as  Zhan et al (2021); Schröder et al (2022) the vast majority of AL strategies uses uncertainty. And thirdly, other works such as  Karamcheti et al (2021); Schröder et al (2022) already evaluated those advanced strategies on nearly the same datasets.

We used the most simple uncertainty-based AL strategy Uncertainty Least Confidence (see Section 2) that directly uses the Uncertainty-measure without further pre-processing, and replaced the softmax function with the alternative Uncertainty-measures. Two additional uncertainty-based baselines were included: Uncertainty Entropy Shannon (1948) and Uncertainty Max-Margin Scheffer et al (2001). The first two further processed the uncertainty values before ranking by either calculating the entropy of the uncertainty, or calculating the margin between the first and second-most probable class, but did not make use of any other information. Another baseline was the passively trained model using all available training data, and pure random sampling. The used strategies are summarized in Table 1. The abbreviations mentioned there are used in the remainder of this paper.

4.2 Datasets, Hardware, and Metrics

Table 2 lists the seven common NLP datasets we used in our experiments to fine-tune the pre-trained Transformer-encoder models in the AL simulations. The datasets were selected as a diverse set of popular NLP datasets, including binary and multi-class classification of different domains of varying difficulty. The datasets were obtained from the Huggingface dataset repository Lhoest et al (2021). We used the train-test-splits provided by Huggingface. All experiments were conducted on a cluster consisting of NVIDIA A100-SXM4 GPUs, AMD EPIC CPU 7352 (2.3GHZ) CPUs, and NVMEe disks. Each experiment was run on a single graphic card with 120GB memory and 16 CPU cores.

Table 3 Arithmetic mean of relative gains for \(acc_{last5}\) using the Uncertainty-Clipping variants from Section 3.3 in combination with the implemented uncertainty measurements, displayed values are percentages

The AL experiments can be evaluated in a multitude of ways. At its core, after each AL iteration, a standard ML metric is being measured for the labeled and withheld dedicated test set. We decided upon the accuracy (acc) metric, which we calculated on a withheld test dataset. It is possible to compare the test accuracy values of the last iteration, the mean of the last five iterations (\(acc_{last5}\)), and the mean of all iterations. The last one equals to the area-under-the-curve when plotting the so-called AL learning curve. As an effective AL strategy should select the most valuable samples for labeling first, those metrics that include the accuracy of multiple iterations are often closer to real use-cases. Nevertheless, at the beginning of the labeling process the fluctuation of the test accuracy for most strategies is very high and contains often surprisingly little information about which strategy is better. The influence of the initially labeled samples is simply so high that a better strategy, with a bad starting point, has no chance to be better than a bad strategy, which has a good starting point. But after a couple of AL iterations, the results stabilize and good strategies can be reliably distinguished from bad strategies, as each strategy tends to approach its own characteristic threshold, regardless of the starting point. Therefore, we decided to use the accuracy of the last five iterations, deliberately ignoring the first iterations with the highly fluctuating results.

Additionally, we measured the runtime. AL, applied in real-life scenarios, is an interactive process. Decisions of the AL strategy should be made in the magnitude of single-digit seconds, longer calculation times render the annotation process impractical.

As each experiment was repeated 10 times, we report in the following the arithmetic mean for the 10 repetitions, separated by dataset.

5 Results

We conducted a series of experiments to compare the alternative Uncertainty-measures (See Section 3):

  • We start by comparing the relative performance gains using the proposed Uncertainty-Clipping variants (Section 5.1).

  • Afterwards, we compare the different Uncertainty-mea-sures, with and without Uncertainty-Clipping to assess the general influence of Uncertainty-Clipping (Section 5.2).

  • Next, we analyze which AL strategies behave similarly, indicated by samples that have been queried by multiple strategies (Section 5.3.

  • Then, we analyze the class distribution of the queried samples (Section 5.4).

  • Due to AL being a human-in-the-loop method, we include a necessary analysis of the runtimes (Section 5.5).

  • Finally, we conclude with an overview of the limitations of our approach (Section 5.6).

5.1 Uncertainty-Clipping

The first experiment targets our proposed Uncertainty-Clipping variants. The results are displayed in Table 3. The columns contain the results per Uncertainty-measure, the rows the different Uncertainty-Clipping methods. As the used metric \(acc_{last5}\) differs per dataset, we are not simply averaging across the different datasets. \(80\%\) accuracy can mean something totally different than \(60\%\) accuracy on another dataset. Therefore, we take the \(acc_{last5}\) per random seed, dataset, Uncertainty-Clipping variant, and alternative Uncertainty-measure and compare the relative percentage gain with the unclipped variant. Each displayed value is then the arithmetic mean of the relative gains over all random seeds and datasets. The Uncertainty-Clipping method with the highest improvement per Uncertainty-measure is highlighted in bold.

Fig. 5
figure 5

Distribution of \(acc_{last5}\) including the Uncertainty-Clipping variants and the average \(acc_{last5}\) values per dataset as colorful line, ordered by \(acc_{last5}\) after Uncertainty-Clipping. The arithmetic means of the runs per method are included as yellow diamonds in the middle of the plots. The vanilla softmax based baselines Ent, MM, and LC are marked in blue, and the baselines Random Selection as well as the Passive classifier are marked in red

The first observation to be made is that every proposed Uncertainty-Clipping method improves almost every alternative Uncertainty-measure. This is a strong indicator for using Uncertainty-Clipping in combination with uncertainty-based AL strategies. The answer to the question which Uncertainty-Clipping method performs best depends on the combination of the Uncertainty-Clipping method, the alternative Uncertainty-measure, and, interestingly, the underlying Transformer-language model as well. Vote Entropy (VE) is the only Uncertainty-measure that does not benefit from Uncertainty-Clipping, but as seen in the next Section, this method is not performing well without either. The overall best Uncertainty-Clipping method, indicated by the last column showing the average gain, is First Peak clipping as it slightly outperforms top-k clipping throughout all results. Therefore, the following evaluations will be made using First Peak clipping, but the results are almost identical for the other Uncertainty-Clipping methods.

The clipping threshold k was always set to 95% and found empirically using hyperparameter tuning, other values of up to 90% also worked well.

5.2 Test Accuracies

The next evaluation goes into detail on how the alternative Uncertainty-measures compare to each other, and how the Uncertainty-Clipping depends on the underlying dataset.

Figure 5 displays the distributions of the average test accuracy of the last five AL iterations \(acc_{last5}\) per method and per dataset. Each result is displayed using two boxplots: The light gray one displays the original values, the dark gray one the ones using Uncertainty-Clipping. The underlying displayed distribution consists of the \(acc_{last5}\) values combined for all datasets. As each method was evaluated using 10 different starting points, we additionally display the arithmetic mean of the 10 repetitions per dataset as an additional colorful stick. The methods are ordered after the mean \(acc_{last5}\) value using Uncertainty-Clipping. As Uncertainty-Clipping does not work on the baselines of a full passive classifier (Pass) and random selection (Rand), only one boxplot is shown for these two.

General Remarks First, it is obvious that the difference between the individual methods depends on the used dataset. But averaged over all datasets (the yellow diamond in the middle of the boxplots indicates the arithmetic mean) the differences become marginally small. Still, it can be safely stated already that the two Softmax Ensemble techniques (KLD and VE) perform far worse than any other technique. This confirms the work of Gleave and Irving (2022), where the usefulness of softmax ensemble methods for Transformer models was investigated, with the same conclusion as ours: ensemble methods perform far worse than even random sampling for Transformer models.

Uncertainty-Clipping The dark gray boxplots represent the results using the First Peak Uncertainty-Clipping method, compared to the results of the unclipped, original results in the light gray boxplot. The reasoning behind the clipping is that a good Uncertainty-measure in an AL strategy selects mostly outliers, as these are indeed the most uncertain values. But solely labeling outliers results in worse performance than Random sampling, as can be seen in Fig. 5 when looking only on the light gray boxplots. Our paradoxical finding is therefore that uncertainty-based AL strategies actually benefit from a slightly less than perfect Uncertainty-measure.

Fig. 6
figure 6

Heatmap for the Jaccard coefficients of the queried samples between each pair of strategies. High coefficients indicate highly similar strategies. On the right side the displayed numbers indicate the difference to the original coefficients achieved by uncertainty-clipping

Comparing the light and dark gray boxplots per method, Uncertainty-Clipping often seems to reduce mostly the lower ends of the distribution. This is expected, as applying Uncertainty-Clipping can only result in ignoring results, and thereby prevention of bad AL choices. As no new results are added, improvement of already good choices is not to be expected. The effect of Uncertainty-Clipping becomes more clear if one compares the colored lines per dataset in each boxplot of the left side with those of the right side. These lines indicate the arithmetic mean per dataset. Especially for the dataset AG-News and Trec-6 the clipping improves the final test accuracy drastically, as the right lines are often higher than the left ones. These datasets, which are more influenced by Uncertainty-Clipping, appear therefore to contain a higher percentage of outliers.

Fig. 7
figure 7

Difference of class distribution of the queried samples compared to the train set for the two datasets TREC-6 and AG’s News for the original BERT model

Some combinations between alternative Uncertainty-measures and datasets do not benefit from Uncertainty-Clipping (e.g. the Rotten Tomatoes Dataset and the Inhibited Softmax method for the BERT model). This is to be expected as we are trying to ignore outlier points that do not really fit into the classification categories and are therefore quite ambiguous. They are very easily influenced by even small changes to the setup, and it may happen in rare cases that also good data points are sometimes wrongly being ignored. But given that NLP datasets are generally very large with lots of data points there are always outlier points which should be ignored. We could not identify a single dataset which never benefits from Uncertainty-Clipping, indicating that every dataset does contain outlier data points. In summary, except for the ensemble method Vote Entropy, we can conclude from our extensive experiments that all methods generally benefit from Uncertainty-Clipping.

Baselines In addition to pure Random Sampling selection, and Uncertainty Least Confidence with the vanilla softmax function, Uncertainty Entropy and Uncertainty Max-Margin – both also using the vanilla softmax function – were included in our evaluation. The baseline strategy Uncertainty Max-Margin performs better on RoBERTa than on BERT, indicating that the softmax function is better calibrated to true probabilities for RoBERTa than for the original BERT model.

Best overall performing Uncertainty-measure method The results differ based on the used Transformer-encoder model. For BERT, no method was able to beat random sampling without our proposed Uncertainty-Clipping, whereas with it, the majority was able to justify the usage of AL. For RoBERTa many methods were better than Random sampling even without Uncertainty-Clipping, but benefited even further from using it. Otherwise, the question of whether to use the softmax activation function or an alternative Uncertainty-measure can be concluded with a confirmation for the status-quo to continue using the softmax function. The pure softmax function-based methods Uncertainty Least Confidence (LC) or Uncertainty Max-Margin (MM) perform still comparably good, with only marginally small differences to most alternative methods. This confirms the conclusion of the deep investigation of Pearce et al (2021) about the softmax function, stating that it might have a more sound basis than widely believed. Still, from the alternative methods, Monte-Carlo Dropout (MC) seems promising on both Transformer-encoder models and performs better than a pure softmax implementation, with Label Smoothing (LS) and Temperature Scaling (TeSc) close behind.

5.3 Queried Samples

We are also interested in the similarities and differences of the alternative Uncertainty-measures. Hence, we calculated the Jaccard coefficients between the sets of queried samples for each pair of compared AL strategies, displayed in Fig. 6. As our experiments were repeated 10 times each, we used the union of the sets of queried samples, and the union of the results per dataset. Additionally, we included the wrongly classified samples (Wrong) by the passive classifier – potentially being outliers – in the plots. The left side contains the Jaccard coefficients for the version without Uncertainty-Clipping, the right side with the potential outliers samples being ignored. The right plots displays the percentual difference to the unclipped version as numbers, whereas the color encoding still indicates the Jaccard coefficient.

For the unclipped version, the most similar strategies are unsurprisingly the three strategies using the pure softmax function: Uncertainty Entropy, Uncertainty Least Confidence, and Uncertainty Max-Margin. As the clipping has a high negative impact on Uncertainty Entropy, it becomes highly dissimilar to almost any other strategy, with a drop of over 40% compared to Uncertainty Least Confidence.

Apart from this, Evidential Neural Networks, Label Smoothing, Monte-Carlo Dropout, Inhibited Softmax, and Temperature Scaling are equally similar to each other in the area of 30%, as has been already been indicated by the similar performance in the previous section. TrustScore is only similar in the area of 25%, and the Softmax Ensemble methods Vote Entropy and Kullback-Leibler Divergence are highly dissimilar to everything else. Also, almost all strategies are, as expected, very dissimilar to pure random selection, and even less similar to the set of wrongly classified samples.

Fig. 8
figure 8

Runtime Comparison of all methods averaged over the datasets in seconds per AL iteration

The negative numbers in the right plot show the difference to the original Jaccard coefficients. Uncertainty-Clipping makes all strategies behave dissimilar, as nearly all Jaccard coefficients go slightly down. This indicates that all strategies without Uncertainty-Clipping label the same set of outliers, and after these are removed, they have of course less similarities. We also included the wrongly classified samples (Wrong) by the fully trained passive classifiers in the analysis. Interestingly, a difference can be seen between the two compared Transformer models: for RoBERTa most strategies sample few wrongly classified samples, which becomes even less after the Uncertainty-Clipping, whereas for the original BERT model the this number even increases slightly with Uncertainty-Clipping.

In conclusion, equally good performing strategies also query similar samples, and Uncertainty-Clipping manages to remove a set of outliers. It appears that the different methods have this quality in common.

5.4 Class Distribution

To further analyze Uncertainty-Clipping, we selected the two datasets, TREC-6 and AG’s News, which have more than two target classes, for an additional deeper analysis of difference in class distribution for the queried samples. We compared the class distributions in the training set, this is displayed in Fig. 7. Uncertainty-Clipping should improve in theory especially those datasets, where many samples have ambiguous labels, which can therefore harm the classification quality. The more classification target classes there are, the higher is the chance of ambiguous labels, hence our focus on the two datasets TREC-6 and AG’s News. Additionally, we include the class distribution across the test dataset, on which the evaluation metrics were being computed. Our expectation regarding the random strategy would be that the class distribution is the same in the queried samples to the full available training dataset, which nearly holds true.

For TREC-6, most AL strategies favor samples from class A and B, and sample less from class D and F. Uncertainty-Clipping strongly enhances the AL selection of class B, and even decreases sampling from class F. This indicates that Uncertainty-Clipping heavily correlates with specific classes, caused by probably many ambiguous labels of samples in class B. This could be an indicator for poor and ambiguous class thresholds between class B and the other classes in TREC-6.

For AG’s News almost all strategies sample evenly distributed among all classes without Uncertainty-Clipping, and significantly less often of class B, but more of class C and D with Uncertainty-Clipping. Uncertainty Max-Margin stands out from the other methods, as it behaves without clipping similar to all other methods that use clipping. This explains why Uncertainty Max-Margin is one of the few methods that does not benefit from Uncertainty-Clipping for Trec-6 and AG’s News, as seen in Fig. 5, despite an otherwise good performance.

We infer from the data that Uncertainty-Clipping vastly influences the distribution of the queried classes towards potentially more interesting classes for labeling, which are the same for all methods.

5.5 Runtime Comparison

AL is in practice a human-in-the-loop process. Annotators expect a responsive system that instructs them immediately and without long waiting times about what to label next. Therefore, a fast runtime of the AL query selection is a crucial factor in making AL real-world usable. In Fig. 8, we display the runtimes averaged over all datasets for our methods in seconds per complete AL loop, measuring only the time of the AL computations, not the Transformer-encoder model fine-tuning time Footnote 4. First, it becomes clear that the overhead of the two Ensemble methods (KLD and VE) of training multiple Transformer-encoder models in parallel is much higher compared to the other methods. A waiting time of a couple of seconds is acceptable for most annotators, wh ereas a constant waiting time of over a full minute per each AL iteration is straining the patience of users. All other methods perform equally fast with a neglectable user waiting time. Taking into consideration the poor performance of the ensemble methods, we refrain from recommending them for use in practice.

Weiss and Tonella (2022) compared Uncertainty Max-Margin, Uncertainty Entropy, Monte-Carlo Dropout, as well as DeepGini, a modified softmax version with a fast runtime in mind. The primary focus of their work is on test input prioritizers to work with very large datasets. They conclude that Monte-Carlo Dropout performs better than the other strategies, but not so much that it justifies its usage over the vanilla softmax, and that more research in that area would be necessary. We can partially confirm the results and argue that besides Ensemble methods all methods are equally good considering only runtime.

5.6 Limitations of Uncertainty-Clipping

Our proposed Uncertainty-Clipping variants works on the assumption that a large portion of harmful outlier data points exist in the dataset. These are data points with high classification ambiguity regarding which class they belong to, sometimes the cause of labeling mistakes by the annotators, or actual ambiguous data points which could belong to multiple classes. Especially for real-world use-cases, where classification is hard, this is a problem due to the complex nature of the real-world. Also, very large datasets, as NLP datasets generally are, have a higher chance of having more outliers simply due to containing more samples in general. Therefore the chance of picking outliers for labeling in the first iterations of AL increases, and the more garbage labels due to the outliers exist, the more clean labeled samples are needed to train a useful classifier. Uncertainty-Clipping is therefore not necessary and potentially harmful for small datasets which will not contain many outliers, and datasets with a high quality standard for the labels where one can be certain that no outliers exist.

6 Conclusion

Noting the importance of NN Uncertainty-measures for AL and the potential shortcomings of simply using vanilla softmax as such a measure, we experimentally compared eight alternative methods over seven datasets using both the original BERT Transformer-encoder model as well as the improved RoBERTa variant. After discovering that better Uncertainty-measures result in selecting only outliers for labeling, we proposed Uncertainty-Clipping, improving all methods, including vanilla softmax, which showed to be the overall best performing technique.

In conclusion, we have not found any evidence for a general bad performance of the vanilla softmax method. Therefore, we can safely recommend to continue use it due to the nearly non-existent application overhead in an AL setting. Yet, the compared methods such as Monte-Carlo Dropout, Label Smoothing, and Temperature Scaling are not far behind in our ranking of the best methods. Our proposed Uncertainty-Clipping improves both vanilla softmax as well as alternative methods, and should be used in practice. We found that either a simple Top-k clipping, or the First Peak variant works well. Future research should compare the effectiveness of Uncertainty-Clipping in combination with more advanced AL strategies, which also make use of the vector space. Coming back to the concrete real-world use-case scenario of categorization of legal documents, our proposed method of Uncertainty-Clipping enables practitioners to use AL to annotate a large data set using minimum effort in the NLP domain using powerful Transformer-architecture based language models, without having to worry about outliers spoiling the resulting labeled dataset.