1 Introduction

The goal in standard supervised learning, such as binary or multi-class classification, is to learn models with high predictive accuracy from labelled training data (Hastie et al. 2005; Vapnik 1999). However, labelled data does not come for free. On the contrary, labelling can be expensive, time-consuming, and costly. The ambition of active learning, therefore, is to exploit labelled data in the most effective way. More specifically, the idea is to let the learning algorithm itself decide which examples it considers to be most informative. Compared to random sampling, the hope is to achieve better performance with the same amount of training data, or to reach the same performance with less data (Fu et al. 2013; Settles 2009).

The selection of training examples is often done in an iterative manner, i.e., the active learner alternates between re-training and selecting new examples. In each iteration, the usefulness of a candidate example is estimated in terms of a utility score, and the one with the highest score is queried. In this regard, the notion of utility typically refers to uncertainty reduction: To what extent will the knowledge about the label of a specific instance help to reduce the learner’s uncertainty about the sough model? In uncertainty sampling (Settles 2009), which is among the most popular approaches, utility is quantified in terms of predictive uncertainty, i.e., the active learner selects those instances for which its current prediction is maximally uncertain. The predictions as well as the measures used to quantify the degree of uncertainty, such as entropy, are almost exclusively of a probabilistic nature. Such approaches indeed proved to be successful in many applications.

Yet, as pointed out by Sharma and Bilgic (2017), existing approaches can be criticized for not informing about the reasons for why an instance is considered uncertain, although this might be relevant for judging the potential usefulness of an example. They propose an evidence-based approach to active learning, in which conflicting-evidence uncertainty is distinguished from insufficient-evidence uncertainty. A similar distinction between two types of uncertainty, called epistemic and aleatoric uncertainty, has been made in the recent machine learning literature (Hüllermeier and Waegeman 2021; Kendall and Gal 2017; Senge et al. 2014). Roughly speaking, aleatoric uncertainty is due to inherent randomness, whereas epistemic uncertainty captures the lack of knowledge of the learner. Thus, the latter corresponds to the reducible and the former to the irreducible part of the total uncertainty in a prediction. Last but not least, measures of uncertainty are also discussed in connection with generalizations of standard probability theory, most notably imprecise probability (De Campos et al. 1994; Zaffalon 2002). Here, incomplete information is captured in the form of credal sets, that is, (convex) sets of probability distributions; correspondingly, standard uncertainty measures for single probability distributions (such as entropy) are generalized toward credal uncertainty measures.

This paper is an extension of (Nguyen et al. 2019), in which the authors conjecture that, in uncertainty sampling, the usefulness of an instance is better reflected by its epistemic than by its aleatoric uncertainty, and provide first evidence in favor of this conjecture. The goal of the current paper is to elaborate more broadly on the usefulness of different measures for uncertainty sampling, and to compare their performance in active learning. To this end, we instantiate uncertainty sampling with different measures, analyze the properties of the sampling strategies thus obtained, and compare them in an experimental study.

The rest of this paper is organized as follows. In the next section, we first recall the general framework of uncertainty sampling and provide a brief survey of related work on active learning. We present different approaches for measuring the learner’s uncertainty in a query instance and a comparison of the approaches in Sects. 3 and 4, respectively. Experimental evaluations for local learning (Parzen window classifier), decision trees and logistic regression are presented in Sect. 5, prior to concluding the paper in Sect. 6. Technical details for instantiations of aleatoric and epistemic uncertainty are deferred to the appendices.

2 Uncertainty sampling

In this section, we briefly recall the basic setting of uncertainty sampling. As usual in active learning, we assume to be given a labelled set of training data \({\mathbf {D}}\) and a pool of unlabeled instances \({\mathbf {U}}\) that can be queried by the learner:

$$\begin{aligned} {\mathbf {D}}=\big \{ (\varvec{x}_1, y_1) , \ldots , (\varvec{x}_N, y_N) \big \} , \quad {\mathbf {U}} = \big \{ \varvec{x}_1, \ldots , \varvec{x}_J \big \} \, . \end{aligned}$$

Instances are represented as features vectors \(\varvec{x}_i = \left( x_i^1,\ldots , x_i^d \right) \in {\mathcal {X}}= {\mathbb {R}}^d\). In this paper, we only consider the case of binary classification, where labels \(y_i\) are taken from \({\mathcal {Y}}= \lbrace 0, 1 \}\), leaving the more general case of multi-class classification for future work. We denote by \({\mathcal {H}}\subset {\mathcal {Y}}^{\mathcal {X}}\) the underlying hypothesis space, i.e., the class of candidate models \(h:\, {\mathcal {X}}\longrightarrow {\mathcal {Y}}\) the learner can choose from. Often, hypotheses are parametrized by a parameter vector \(\theta \in \Theta\); in this case, we equate a hypothesis \(h= h_\theta \in {\mathcal {H}}\) with the parameter \(\theta\), and the model space \({\mathcal {H}}\) with the parameter space \(\Theta\).

In uncertainty sampling, instances are queried in a greedy fashion. Given the current model \(\theta\) that has been trained on \({\mathbf {D}}\), each instance \(\varvec{x}_j\) in the current pool \({\mathbf {U}}\) is assigned a utility score \(s(\theta ,\varvec{x}_j)\), and the next instance to be queried is the one with the highest score (Lewis and Gale 1994; Settles 2009; Settles and Craven 2008; Sharma and Bilgic 2017). The chosen instance is labelled (by an oracle or expert) and added to the training data \({\mathbf {D}}\), on which the model is then re-trained. The active learning process for a given budget B (i.e, the number of unlabelled instances to be queried) is summarized in Algorithm 1.

figure a

Assuming a probabilistic model producing predictions in the form of probability distributions \(p_\theta ( \cdot \, \vert \, \varvec{x})\) on \({\mathcal {Y}}\), the utility score is typically defined in terms of a measure of uncertainty. Thus, instances on which the current model is highly uncertain are supposed to be maximally informative (Settles 2009; Settles and Craven 2008; Sharma and Bilgic 2017). Popular examples of such measures include

  • the entropy:

    $$\begin{aligned} s(\theta ,\varvec{x}) = - \sum _{y\in {\mathcal {Y}}} p_{\theta }(y\, \vert \, \varvec{x})\log p_{\theta }(y\, \vert \, \varvec{x}) \, , \end{aligned}$$
    (1)
  • the least confidence:

    $$\begin{aligned} s(\theta ,\varvec{x}) = 1 - \max _{y\in {\mathcal {Y}}} p_{\theta }(y\, \vert \, \varvec{x}) \, , \end{aligned}$$
    (2)
  • the smallest margin:

    $$\begin{aligned} s(\theta ,\varvec{x})= p_{\theta }&(y_m\, \vert \, \varvec{x}) - p_{\theta }(y_n\, \vert \, \varvec{x}) \, , \end{aligned}$$
    (3)

    where \(y_m = \arg \max _{y\in {\mathcal {Y}}} p_{\theta }(y\, \vert \, \varvec{x})\) and \(y_n = \arg \max _{y\in {\mathcal {Y}}\setminus y_m} p_{\theta }(y\, \vert \, \varvec{x})\).

While the first two measures ought to be maximized, the last one has to be minimized. In the case of binary classification, i.e, \({\mathcal {Y}}= \{ 0, 1 \}\), all these measures rank unlabelled instances in the same order and look for instances with small difference between \(p_{\theta }(0 \, \vert \, \varvec{x})\) and \(p_{\theta }(1 \, \vert \, \varvec{x})\).

3 Measures of uncertainty

In this section, we present different frameworks for measuring the learner’s uncertainty in a query instance: evidence-based uncertainty (EBU), credal uncertainty (CU), and an approach focusing on a distinction between epistemic and aleatoric uncertainty (EAU). While the first one has been specifically developed for the purpose of active learning, the other two are more general approaches to uncertainty quantification in machine learning. Yet, their potential usefulness for active learning has been pointed out as well (Antonucci et al. 2012; Nguyen et al. 2019).

3.1 Evidence-based uncertainty (EBU)

In their evidence-based uncertainty sampling approach, Sharma and Bilgic (2013, 2017) propose to differentiate between uncertainty due to conflicting evidence and insufficient evidence. The corresponding measures of conflicting-evidence uncertainty and insufficient-evidence uncertainty are mainly motivated for the Naïve Bayes (NB) classifier as a learning algorithm. In the spirit of this classifier, evidence-based uncertainty sampling first looks at the influence of individual features \(x^m\) in the feature representation \(\varvec{x} = (x^1, \ldots , x^d)\) of instances. More specifically, given the current model \(\theta\), denote by \(p_\theta (x^m \, \vert \, 0)\) and \(p_\theta (x^m \, \vert \, 1)\) the class-conditional probabilities on the values of the \(m^{th}\) feature. For a given instance \(\varvec{x}\), the authors partition the set of features into those that provide evidence for the positive and for the negative class, respectively:

$$\begin{aligned} P_{\theta }(\varvec{x})&= \bigg \{ x^m \, \bigg \vert \, \frac{p_{\theta }(x^m \, \vert \, 1)}{p_{\theta }(x^m \, \vert \, 0)} > 1 \bigg \} \, , \end{aligned}$$
(4)
$$\begin{aligned} N_{\theta }(\varvec{x})&= \bigg \{ x^m \, \bigg \vert \, \frac{p_{\theta }(x^m \, \vert \, 0)}{p_{\theta }(x^m \, \vert \, 1)} > 1 \bigg \} \, . \end{aligned}$$
(5)

Then, the total evidence for the positive and the negative class is determined as follows:

$$\begin{aligned} E_{1}(\varvec{x})&= \prod _{x^m \in P_{\theta }(\varvec{x})} \frac{p_{\theta }(x^m \, \vert \, 1)}{p_{\theta }(x^m \, \vert \, 0)} \, , \end{aligned}$$
(6)
$$\begin{aligned} E_{0}(\varvec{x})&= \prod _{x^m \in N_{\theta }(\varvec{x})} \frac{p_{\theta }(x^m \, \vert \, 0)}{p_{\theta }(x^m \, \vert \, 1)} \, . \end{aligned}$$
(7)

The authors consider a situation as conflicting evidence if both \(E_{0}(\varvec{x})\) and \(E_{1}(\varvec{x})\) are high, because in such as situation, there is strong evidence in favor of the positive as well as strong evidence in favor of the negative class. Likewise, a situation in which both evidences are low is considered as insufficient evidence. Measuring these conditions in terms of the productFootnote 1\(E_{1}(\varvec{x}) \times E_{0}(\varvec{x})\), the conflicting evidence-based approach simply queries the instance with the highest conflicting evidence, while the insufficient evidence-based approach looks for the one with the highest insufficient evidence:

$$\begin{aligned} s^*_{conf}&= \arg \max _{\varvec{x} \in {\mathbf {S}}} E_{1}(\varvec{x}) \times E_{0}(\varvec{x}) \, , \end{aligned}$$
(8)
$$\begin{aligned} s^*_{insu}&= \arg \min _{\varvec{x} \in {\mathbf {S}}}E_{1}(\varvec{x}) \times E_{0}(\varvec{x}) \, . \end{aligned}$$
(9)

Note that the selection is restricted to the set \({\mathbf {S}}\) of instances \(\varvec{x}\) in the pool \({\mathbf {U}}\) having the highest scores \(s(\theta , \varvec{x})\) according to standard uncertainty sampling; the size of this set, \(t = |{\mathbf {S}}|\), is a parameter of the method (and hence a hyper-parameter for the active learning algorithm). The restriction to the most uncertain cases puts evidence-based uncertainty sampling close to standard uncertainty sampling. Instead of using conflicting-evidence and insufficient-evidence uncertainties as selection criteria on their own, they are merely used for prioritizing cases that appear to be uncertain in the traditional sense.

3.1.1 A note on evidence-based uncertainty

Interestingly, to motivate their approach, Sharma and Bilgic (2017) note that “regardless of whether we want to maximize or minimize \(E_{1}(\varvec{x})\times E_{0}(\varvec{x})\), we want to guarantee that the underlying model is uncertain about the chosen instance”, thereby suggesting that the evidence-based uncertainties alone do not necessarily inform about this uncertainty. Indeed, it is true that these uncertainties are not easy to interpret (see also our discussion in Sect. 4.1), and that their relationship to standard uncertainty measures is not fully obvious.

In particular, note that the latter also comprises the influence of the prior class probabilities, which is completely neglected by the evidence-based uncertainties (which only look at the likelihood). This is especially relevant in the case of imbalanced class distributions. In such cases, evidence-based uncertainty may strongly deviate from standard uncertainty, i.e., the entropy of the posterior distribution. For instance, \(E_0(\varvec{x})\) and \(E_1(\varvec{x})\) could both be very large, and \(p_\theta (\varvec{x} \, \vert \, 0) \approx p_\theta (\varvec{x} \, \vert \, 1)\), although \(p_\theta (0 \, \vert \, \varvec{x})\) is very different from \(p_\theta (1 \, \vert \, \varvec{x})\) due to unequal prior odds, and hence the entropy small. Likewise, the entropy of the posterior can be large although both evidence-based uncertainties are small.

3.1.2 A note on uncertainty sampling for Naïve Bayes

The evidence-based approach to uncertainty sampling has been introduced with a focus on Naïve Bayes as a base learner. In this regard, we like to note that uncertainty sampling for this learner might be considered critical in general.

It is clear that active learning may always incorporate a bias, simply because the data is no longer produced by sampling independently according to the true underlying distribution. Thus, the data is no longer completely representative. While this may affect any learning algorithm, the effect appears to be especially strong for NB, so that uncertainty sampling for NB appears to be questionable in general. In fact, a sample bias has a very direct influence on the probabilities estimated by NB. In particular, the estimated class priors are strongly biased toward the conditional class probabilities of those instances with a high uncertainty, because these are sampled more often. This bias may in turn affect the classifier as a whole, and lead to suboptimal predictions.

As an illustration of the problem, let us consider a small example with only two binary attributes \(x^1\) and \(x^2\). This example may appear unrealistic, because the instance space is finite and actually quite small. Please note, however, that even in practice NB is typically applied to discrete attributes with finite domains (possibly after a discretization of numerical attributes in a pre-processing step).

Suppose the class priors to be given by \(p(y=0) = 0.3\) and \(p(y=1) = 0.7\), and the class-conditional probabilities as follows:

$$\begin{aligned} p(x^1 =1 \, \vert \, y=0)&= 0.4 \,, \\ p(x^1=1 \, \vert \, y=1)&= 0.2 \,, \\ p(x^2 =1 \, \vert \, y=0)&= 0.8 \,, \\ p(x^2=1 \, \vert \, y=1)&= 0.4 \, . \end{aligned}$$

From these, one derives the following posterior probabilities:

$$\begin{aligned} p(y=0 \, \vert \, x^1=0, x^2=0)&\approx 0.10 \,,&p(y=1 \, \vert \, x^1=0, x^2=0)&\approx 0.90 \, , \\ p(y=0 \, \vert \, x^1=0, x^2=1)&\approx 0.40 \,,&p(y=1 \, \vert \, x^1=0, x^2=1)&\approx 0.60 \, ,\\ p(y=0 \, \vert \, x^1=1, x^2=0)&\approx 0.22 \,,&p(y=1 \, \vert \, x^1=1, x^2=0)&\approx 0.78 \, ,\\ p(y=0 \, \vert \, x^1=1, x^2=1)&\approx 0.63 \,,&p(y=1 \, \vert \, x^1=1, x^2=1)&\approx 0.37 \, .\\ \end{aligned}$$

Now, consider an active learner that can sample from a large (in principle infinite) pool of unlabeled data points (i.e., multiple copies of each of the four instances). Since the second instance \((x^1, x^2) =(0,1)\) has the highest entropy, standard uncertainty sampling will sooner or later focus on this instance and sample it over and over againFootnote 2. This of course has an influence on the estimation of priors and conditional probabilities by NB. In particular, the estimated class priors \(\hat{p}(y=0)\) and \(\hat{p}(y=1)\) will converge to the conditional posteriors, i.e., the posteriors of y given \((x^1, x^2) =(0,1)\). Consequently, we will produce a bias in the estimates, and will obtain

$$\begin{aligned} \hat{p}(y=0 \, \vert \, x^1=0, x^2=0)&\approx 0.19 \,,&\hat{p}(y=1 \, \vert \, x^1=0, x^2=0)&\approx 0.81 \,, \\ \hat{p}(y=0 \, \vert \, x^1=0, x^2=1)&\approx 0.38 \,,&\hat{p}(y=1 \, \vert \, x^1=0, x^2=1)&\approx 0.62 \,, \\ \hat{p}(y=0 \, \vert \, x^1=1, x^2=0)&\approx 0.24 \,,&\hat{p}(y=1 \, \vert \, x^1=1, x^2=0)&\approx 0.76 \,, \\ \hat{p}(y=0 \, \vert \, x^1=1, x^2=1)&\approx 0.45 \,,&\hat{p}(y=1 \, \vert \, x^1=1, x^2=1)&\approx 0.56 \,.\\ \end{aligned}$$

As one can see, this will even have an effect on the Bayes-optimal predictor: For \((x^1, x^2) =(1,1)\), the prediction will be \(\hat{y} =1\) instead of the actually optimal prediction \(\hat{y} =0\). Similar effects can be found for the evidence-based approach. For example, when applying the insufficient evidence approach, it can happen that the active learner will completely focus on the third instance, which has the highest insufficient evidence, and then produce the following estimates:

$$\begin{aligned} \hat{p}(y=0 \, \vert \, x^1=0, x^2=0)&\approx 0.14 \,,&\hat{p}(y=1 \, \vert \, x^1=0, x^2=0)&\approx 0.86 \,,\\ \hat{p}(y=0 \, \vert \, x^1=0, x^2=1)&\approx 0.36 \,,&\hat{p}(y=1 \, \vert \, x^1=0, x^2=1)&\approx 0.64 \,,\\ \hat{p}(y=0 \, \vert \, x^1=1, x^2=0)&\approx 0.23 \,,&\hat{p}(y=1 \, \vert \, x^1=1, x^2=0)&\approx 0.78 \,,\\ \hat{p}(y=0 \, \vert \, x^1=1, x^2=1)&\approx 0.49 \,,&\hat{p}(y=1 \, \vert \, x^1=1, x^2=1)&\approx 0.51 \,.\\ \end{aligned}$$

So again, the prediction for \((x^1, x^2) =(1,1)\) will be \(\hat{y} =1\) instead of \(\hat{y} =0\).

3.1.3 Extension to other learners

As explained above, the approach by Sharma and Bilgic (2017) is specifically tailored for Naïve Bayes as a learning algorithm. Yet, the authors also propose variants of their measures for logistic regression and support vector machines. For example, if the decision boundary obtained by fitting a logistic regression model is given by \(h_{\theta }(\varvec{x})= \theta _0 + \sum _{i=1}^d \theta _i \cdot x^i\), the evidences for the positive and the negative class are defined, respectively, as follows (Sharma and Bilgic 2017):

$$\begin{aligned} E_{1}(\varvec{x}) = \sum _{x^m \in P_{\theta }(\varvec{x})} \theta _m \cdot x^m , \quad E_{0}(\varvec{x}) = - \sum _{x^m \in N_{\theta }(\varvec{x})} \theta _m \cdot x^m \, , \end{aligned}$$
(10)

where

$$\begin{aligned} P_{\theta }(\varvec{x}) = \big \{ x^m \, \big \vert \, \theta _m \cdot x^m > 0 \big \} , \quad N_{\theta }(\varvec{x}) = \big \{ x^m \, \big \vert \, \theta _m \cdot x^m < 0 \big \} \, . \end{aligned}$$
(11)

Obviously, evidence-based uncertainty measures can be derived is a quite natural way for models in which the features independently contribute to the prediction. However, the approach becomes much less straightforward in the case where features may interact with each other. In any case, new measures need to be derived for every model class separately. The approaches to be discussed next are more generic (and hence more principled) in the sense of being independent of the model class. That is, concrete measures of uncertainty can be derived for any model class in a generic way.

3.2 Credal uncertainty (CU)

Consider an instance space \({\mathcal {X}}\), output space \({\mathcal {Y}}= \{ 0, 1 \}\), and a hypothesis space \({\mathcal {H}}\) consisting of probabilistic classifiers \(h: {\mathcal {X}}\longrightarrow [0,1]\). Assuming that each hypothesis \(h = h_\theta\) is identified by a (unique) parameter vector \(\theta \in \Theta\), we can equate \({\mathcal {H}}\) with the parameter space \(\Theta\). We denote by \(p_{\theta }(1 \, \vert \, \varvec{x}) = h_\theta (\varvec{x})\) and \(p_{\theta }(0 \, \vert \, \varvec{x}) = 1- h_\theta (\varvec{x})\) the (predicted) probability that instance \(\varvec{x} \in {\mathcal {X}}\) belongs to the positive and negative class, respectively.

Credal uncertainty sampling (Antonucci et al. 2012) seeks to differentiate between the reducible and irreducible part of the uncertainty in a prediction. Denote by \(C \subseteq \Theta\) a credal set of models, i.e., a set of plausible candidate models. We say that a class \(y\) dominates another class \(y'\) if \(y\) is more probable than \(y'\) for each distribution in the credal set, that is

$$\begin{aligned} \gamma (y,y', \varvec{x}) = \inf_{\theta \in C} \frac{p_{\theta}(y\, \vert \, \varvec{x})}{p_{\theta}(y'\, \vert\, \varvec{x})} > 1 \, . \end{aligned}$$
(12)

The credal uncertainty sampling approach simply looks for the instance \(\varvec{x}\) with the highest uncertainty, i.e, the least evidence for the dominance of one of the classes. In the case of binary classification with \({\mathcal {Y}}= \{ 0, 1 \}\), this is expressed by the score

$$\begin{aligned} s(\varvec{x}) = - \max \big (\gamma (1, 0, \varvec{x} ), \gamma (0, 1, \varvec{x} ) \big ) \, . \end{aligned}$$
(13)

Practically, the computations are based on the interval-valued probabilities

$$\begin{aligned} {\underline{p}}(y\, \vert \, \varvec{x}), {\overline{p}}(y\, \vert \, \varvec{x})] \,, \end{aligned}$$

assigned to each class \(y\in \{0,1\}\), where

$$\begin{aligned} {\underline{p}}(y\, \vert \, \varvec{x}) = \inf _{\theta \in C} p_{\theta }(y\, \vert \, \varvec{x})\, , \quad {\overline{p}}(y\, \vert \, \varvec{x}) = \sup _{\theta \in C} p_{\theta }(y\, \vert \, \varvec{x}) \, . \end{aligned}$$
(14)

Such interval-valued probabilities can be produced within the framework of the Naïve credal classifier (Antonucci et al. 2012; Antonucci and Cuzzolin 2010; De Campos et al. 1994; Zaffalon 2002). In the case of binary classification, where \(p_{\theta }(0 \, \vert \, \varvec{x}) = 1 -p_{\theta }(1 \, \vert \, \varvec{x})\), the score \(\gamma (1,0, \varvec{x})\) can be rewritten as follows:

$$\begin{aligned} \gamma (1,0, \varvec{x}) = \inf _{\theta \in C} \frac{p_{\theta }(1 \, \vert \, \varvec{x})}{p_{\theta }(0 \, \vert \, \varvec{x})} = \inf _{\theta \in \Theta } \frac{p_{\theta }(1\, \vert \, \varvec{x})}{1- p_{\theta }(1 \, \vert \, \varvec{x})} = \frac{{\underline{p}}(1\, \vert \, \varvec{x})}{1- {\underline{p}}(1\, \vert \, \varvec{x})} \,. \end{aligned}$$
(15)

Likewise,

$$\begin{aligned} \gamma (0,1, \varvec{x}) = \inf _{\theta \in C} \frac{p_{\theta }(0 \, \vert \, \varvec{x})}{p_{\theta }(1 \, \vert \, \varvec{x})} = \inf _{\theta \in C} \frac{1- p_{\theta }(1 \, \vert \, \varvec{x})}{p_{\theta }(1\, \vert \, \varvec{x})} = \frac{1- {\overline{p}}(1\, \vert \, \varvec{x})}{{\overline{p}}(1\, \vert \, \varvec{x})} \, . \end{aligned}$$
(16)

Finally, the uncertainty score (13) can simply be expressed as follows:

$$\begin{aligned} s(\varvec{x}) = - \max \bigg ( \frac{{\underline{p}}(1\, \vert \, \varvec{x})}{1- {\underline{p}}(1\, \vert \, \varvec{x})}, \frac{1- {\overline{p}}(1\, \vert \, \varvec{x})}{{\overline{p}}(1\, \vert \, \varvec{x})}\bigg ) \,. \end{aligned}$$
(17)

3.3 Epistemic and aleatoric uncertainty (EAU)

A distinction between the epistemic and aleatoric uncertainty (Hora 1996) in a prediction for an instance \(\varvec{x}\) has been motivated by Senge et al. (2014)Footnote 3. Their approach is based on the use of relative likelihoods, historically proposed by Birnbaum (1962) and then justified in other settings such as possibility theory (Walley and Moral 1999).

Given a set of training data \({\mathbf {D}}= \{ (\varvec{x}_i , y_i) \}_{i=1}^N \subset {\mathcal {X}}\times {\mathcal {Y}}\), the normalized likelihood of a model \(h_\theta\) is defined as

$$\begin{aligned} \pi _{\Theta }(\theta ) = \frac{L(\theta )}{L(\theta ^{ml})} = \frac{L(\theta )}{\max _{\theta ' \in \Theta } L(\theta ')} , \end{aligned}$$
(18)

where \(L(\theta ) = \prod _{i=1}^N p_{\theta }(y_i \, \vert \, \varvec{x}_i)\) is the likelihood of \(\theta\), and \(\theta ^{ml} \in \Theta\) the maximum likelihood estimation on the training data. For a given instance \(\varvec{x}\), the degrees of support (plausibility) of the two classes are defined as follows:

$$\begin{aligned} \pi (1\, \vert \, \varvec{x})= & {} \sup _{\theta \in \Theta } \min \big [\pi _{\Theta }(\theta ), p_{\theta }(1 \, \vert \, \varvec{x}) - p_{\theta }(0 \, \vert \, \varvec{x}) \big ], \\ \pi (0 \, \vert \, \varvec{x})= & {} \sup _{\theta \in \Theta } \min \big [\pi _{\Theta }(\theta ), p_{\theta }(0 \, \vert \, \varvec{x}) - p_{\theta }(1 \, \vert \, \varvec{x}) \big ]. \end{aligned}$$

So, \(\pi (1 \, \vert \, \varvec{x})\) is high if and only if a highly plausible model supports the positive class much stronger (in terms of the assigned probability mass) than the negative class (and \(\pi (0 \, \vert \, \varvec{x})\) can be interpreted analogously)Footnote 4. Note that, with \(f(a)= 2a-1\), we can also write

$$\begin{aligned} \pi (1\, \vert \, \varvec{x})= & {} \sup _{\theta \in \Theta } \min \big [\pi _{\Theta }(\theta ), f(h_\theta (\varvec{x})) \big ], \end{aligned}$$
(19)
$$\begin{aligned} \pi (0 \, \vert \, \varvec{x})= & {} \sup _{\theta \in \Theta } \min \big [\pi _{\Theta }(\theta ), f(1- h_\theta (\varvec{x})) \big ]. \end{aligned}$$
(20)

Given the above degrees of support, the degrees of epistemic uncertainty \(u_e\) and aleatoric uncertainty \(u_a\) are defined as follows:

$$\begin{aligned} u_e(\varvec{x})= & {} \min \big [ \pi (1 \, \vert \, \varvec{x}), \pi (0 \, \vert \, \varvec{x}) \big ] \, , \end{aligned}$$
(21)
$$\begin{aligned} u_a(\varvec{x})= & {} 1 - \max \big [ \pi (1 \, \vert \, \varvec{x}), \pi (0 \, \vert \, \varvec{x}) \big ] \, . \end{aligned}$$
(22)

Thus, epistemic uncertainty refers to the case where both the positive and the negative class appear to be plausible, while the degree of aleatoric uncertainty (22) is the degree to which none of the classes is supported. Roughly speaking, aleatoric uncertainty is due to influences on the data-generating process that are inherently random, whereas epistemic uncertainty is caused by a lack of knowledge. Or, stated differently, \(u_e\) and \(u_a\) measure the reducible and the irreducible part of the total uncertainty, respectively.

It is thus tempting to assume that epistemic uncertainty is more relevant for active learning: While it makes sense to query additional class labels in regions where uncertainty can be reduced, doing so in regions of high aleatoric uncertainty appears to be less reasonable. This leads us to suggest the principle of epistemic uncertainty sampling, which prescribes the selection

$$\begin{aligned} \varvec{x}^* =&\arg \max _{\varvec{x} \in {\mathbf {U}}} u_e(\varvec{x}) \, . \end{aligned}$$
(23)

For comparison, we will also consider an analogous selection rule based on the aleatoric uncertainty, i.e.,

$$\begin{aligned} \varvec{x}^* =&\arg \max _{\varvec{x} \in {\mathbf {U}}} u_a(\varvec{x}) \, . \end{aligned}$$
(24)

As already said, this approach is completely generic and can in principle be instantiated with any hypothesis space \({\mathcal {H}}\). The uncertainty measures (2122) can be derived very easily from the support degrees (1920). The computation of the latter may become difficult, however, as it requires the solution of an optimization problem, the properties of which depend on the choice of \({\mathcal {H}}\). We are going to present practical methods to determine (1920) for the cases of a simple Parzen window classifier and logistic regression in Sects. A.1 and A.2, respectively.

4 Discussion and comparison of the approaches

4.1 EAU versus EBU

Although the concepts of “conflicting evidence” and “insufficient evidence” of Sharma and Bilgic (2017) appear to be quite related, respectively, to aleatoric and epistemic uncertainty, the correspondence becomes much less obvious (and in fact largely disappears) upon a closer inspection. Besides, a direct comparison is complicated due to various technical issues with the evidence-based approach to uncertainty sampling. In particular, due to the preselection of the top-t uncertain instances (the set \({\mathbf {S}}\)), evidence-based uncertainty sampling is actually a variant of standard (entropy-based) uncertainty sampling, and completely degenerates to the latter for \(t=1\). As we are more interested in alternative measures of uncertainty, we will subsequently ignore the preselection step, and instead focus our discussion on the nature of the evidence measures themselves. In other words, we consider a version of evidence-based uncertainty sampling with a very large t. Before proceeding, let us emphasize that this is not the version proposed by the authors. Therefore, our discussion should be taken with a grain of salt.

As a first important observation, note that the evidences \(E_0(\varvec{x})\) and \(E_1(\varvec{x})\) solely depend on the relation of the class-conditional probabilities \(p_\theta (x^m \, \vert \, 1)\) and \(p_\theta (x^m \, \vert \, 0)\), which hides the number of training examples they have been estimated from, and hence their confidence. The latter, however, has an important influence on whether something is qualified as aleatorically or epistemically uncertain. As an illustration, consider a simple example with two binary attributes, the first with domain \(\{ a_1 , a_2 \}\) and the second with domain \(\{ b_1 , b_2 \}\). Denote by \(n_{i,j} = (n_{i,j}^+ , n_{i,j}^-)\) the number of positive and negative examples observed for \((x_1, x_2)=(a_i, b_j)\). Here are two scenarios:

 

\(b_1\)

\(b_2\)

 

\(b_1\)

\(b_2\)

\(a_1\)

(1, 1)

(1, 1)

\(a_1\)

(100, 100)

(100, 100)

\(a_2\)

(1, 1)

(1, 1)

\(a_2\)

(100, 100)

(100, 100)

In the both scenarios, the insufficient evidence would be high, because all class-conditional probabilities are equal. In EAU, however, the first scenario would largely be a case of epistemic uncertainty, due to the few number of training examples, whereas the second would be aleatoric, because the equal posteriorsFootnote 5 are sufficiently “confirmed”. Similar remarks apply to conflicting evidence. In the scenario

 

\(b_1\)

\(b_2\)

\(a_1\)

(1, 1)

(10, 1)

\(a_2\)

(1, 10)

(1, 1)


the latter would be high for \((a_1,b_1)\), because \(p_\theta (a_1 \, \vert \, 1) \gg p_\theta (a_1 \, \vert \, 0)\) and \(p_\theta (b_1 \, \vert \, 0) \gg p_\theta (b_1 \, \vert \, 1)\). The same holds for \((a_2, b_2)\), whereas the uncertainties for \((a_1,b_2)\) and \((a_2,b_1)\) would be low. Note, however, that in all these cases, exactly the same conditional probability estimates \(p_\theta (x^m \, \vert \, 1)\) and \(p_\theta (x^m \, \vert \, 0)\) are involved.

We would argue that epistemic uncertainty should directly refer to these probabilities, because they constitute the parameter \(\theta\) of the model. Thus, to reduce epistemic uncertainty (about the right model \(\theta\)), one should look for those examples that will mostly improve the estimation of these probabilities. Aleatoric uncertainty may occur in cases of posteriors close to 1/2, in which the conflicting evidence may indeed be high (although, as already mentioned, the latter ignores the class priors). Yet, we would not necessarily call such cases a “conflict”, because the predictions are completely in agreement with the underlying model (Naïve Bayes), which assumes class-conditional independence of attributes, i.e., an independent combination of evidences on different attributes.

Fig. 1
figure 1

Two scenarios for logistic regression: training data with positive (red crosses) and negative examples (black circles) and five query instances

Another illustration is provided in Fig. 1, now for the case of logistic regression. A first important observation is that the uncertainties due to conflicting and insufficient evidence are exactly the same in both scenarios in Fig. 1, the left and the right one. This is because these uncertainties are merely derived from the single model \(h_{\theta }\) learned from the training data. Thus, like in the example for Naïve Bayes, the evidence-based approach does not capture model uncertainty, i.e., uncertainty about the truly optimal model (which is clearly larger on the left and smaller on the right), which EAU essentially measures in terms of epistemic uncertainty.

For the first three queries, the evidence-based uncertainties are very different: The first query has a high insufficient-evidence uncertainty, the third has a high conflicting-evidence uncertainty, and the second none of the two. According to EAU, the uncertainties for these three cases are all high (because they are all located close to the decision boundary) and, more importantly, of the same nature: mostly aleatoric in the right and a mix of aleatoric and epistemic in the left scenario. For the second, fourth and fifth query, the evidence-based uncertainties are roughly the same. Again, this is very different from EAU, which assigns a high uncertainty to the second but very low uncertainties to the fourth and fifth query.

Fig. 2
figure 2

Two scenarios for logistic regression: training data with positive (red crosses) and negative examples (black circles) and five query instances

Figure 2 shows two very similar scenarios, but now with a bias term. As already said, the evidence-approach does not account for such a bias. Moreover, since \(w_1 \approx 0\) (the first feature does not seem to have an influence), there is essentially no negative evidence, i.e., \(E_{0}(\varvec{x})\) is always close to 0. Consequently, the product \(E_{0}(\varvec{x}) \times E_{1}(\varvec{x})\) will be small, too, suggesting that conflicting-evidence uncertainty is always low and insufficient-evidence uncertainty always high.

As shown by these examples, the additional uncertainty captured by EBU is very different from aleatoric and epistemic uncertainty in EAU. In particular, the evidence-based approach can be criticized for ignoring model uncertainty as well as properties of the model class. Although the measures of evidence for the positive and negative class, such as (10), are derived from the model, the evidences are “feature-based” in the sense of considering the evidence provided by each feature in isolation. What is not taken into account, however, is the way in which the model combines the features into an overall prediction. In logistic regression, for example, the features are linearly combined into a single score, and the class probabilities are expressed as a function of this score. For instance, a model like the one we considered in our example,

$$\begin{aligned} p( y = 1 \, \vert \, \varvec{x}) = \frac{1}{1 + \exp ( - \gamma (x_2 - x_1))} \, , \end{aligned}$$

assumes that the probability of the positive class is a function of the difference between \(x_2\) and \(x_1\). One may wonder, therefore, why one should consider a case where both \(x_1\) and \(x_2\) are large as a conflict (and, likewise, a case with both values being small as not providing sufficient evidence for a prediction). From this point of view, the very idea of conflicting (and, likewise, insufficient) evidence may appear somewhat questionable.

4.2 EAU versus CU

Credal uncertainty (sampling) seems to be closer to EAU, at least in terms of the underlying principle. In both approaches, model uncertainty is captured in terms of a set of plausible candidate models from the underlying hypothesis space, and this (epistemic) uncertainty about the right model is translated into uncertainty about the prediction for a given \(\varvec{x}\). In credal uncertainty sampling, the candidate set is given by the credal set C, which corresponds to the distribution \(\pi _{\Theta }\) in EAU—as a difference, we thus note that the latter is a “graded set”, to which a candidate \(\theta\) belongs with a certain degree of membership (the relative likelihood), whereas a credal set is a standard set in which a model is either included or not. Using machine learning terminology, C plays the role of a version space (Mitchell 1977), whereas \(\pi _{\Theta }\) represents a kind of generalized (graded) version space (Hüllermeier 2003).

More specifically, the wider the interval \([{\underline{p}}(1 \, \vert \, \varvec{x}), {\overline{p}}(1 \, \vert \, \varvec{x})]\) in (17), the larger the score \(s(\varvec{x})\), with the maximum being obtained for the case [0, 1] of complete ignorance. This is well in agreement with the degree of epistemic uncertainty in EAU. In the limit, when \([{\underline{p}}(1 \, \vert \, \varvec{x}), {\overline{p}}(1 \, \vert \, \varvec{x})]\) reduces to a precise probability \(p(1 \, \vert \, \varvec{x})\), i.e., the epistemic uncertainty disappears, (17) is maximal for \(p(1 \, \vert \, \varvec{x}) = 1/2\) and minimal for \(p(1 \, \vert \, \varvec{x})\) close to 0 or 1. Again, this behavior is in agreement with the conception of aleatoric uncertainty in EAU. More generally, comparing two intervals of the same length, (17) will be larger for the one that is closer to the middle point 1/2. Thus, it seems that the credal uncertainty score (17) combines both epistemic and aleatoric uncertainty in a single measure.

Fig. 3
figure 3

From left to right: exponential rescaling of the credal uncertainty measure (17), epistemic uncertainty \(u_e\) and aleatoric uncertainty \(u_a\) for intervals \([{\underline{p}}, {\overline{p}}]\) with lower probability \({\underline{p}}\) (x-axis) and upper probability \({\overline{p}}\) (y-axis). Lighter colors indicate higher values

Yet, upon closer examination, its similarity to epistemic uncertainty is much higher than the similarity to aleatoric uncertainty. Note that, for EAU, the special case of a credal set C can be imitated with the measure \(\pi _{\Theta }(\theta ) = 1\) if \(\theta \in C\) and \(\pi _{\Theta }(\theta ) = 0\) if \(\theta \not \in C\). Then, (19) and (20) become

$$\begin{aligned} \pi (1 \, \vert \, \varvec{x})&= \sup _{\theta \in C} \max [ \, 2 \, p_\theta (1 \, \vert \, \varvec{x}) - 1 , 0 \, ] = \max [ \, 2 \, {\overline{p}}(1 \, \vert \, \varvec{x}) - 1 , 0 \, ] \, ,\\ \pi (0 \, \vert \, \varvec{x})&= \sup _{\theta \in C} \max [ \, 2 \, p_\theta (0 \, \vert \, \varvec{x}) - 1 , 0 \, ] = \max [ \, 1 - 2 \, {\underline{p}}(1 \, \vert \, \varvec{x}) , 0 \, ] \, , \end{aligned}$$

and \(u_e\) and \(u_a\) can be derived from these values as before. Figure 3 shows a graphical illustration of the credal uncertainty scoreFootnote 6 (17) as a function of the probability bounds \({\underline{p}}\) and \({\overline{p}}\), and the same illustration is given for epistemic uncertainty \(u_e\) and aleatoric uncertainty \(u_a\). From the visual impression, it is clear that the credibility score closely resembles \(u_e\), while behaving quite differently from \(u_a\). This impression is corroborated by a simple correlation analysis, in which we ranked the intervals

$$\begin{aligned}{}[{\underline{p}}, {\overline{p}}] \in \left\{ \, I_{a,b} = \left[ \frac{a}{100}, \frac{b}{100} \right] \, \Big \vert \, a,b \in \{0,1, \ldots , 100\}, \, a \le b \, \right\} \, , \end{aligned}$$

i.e., a quantization of the class of all probability intervals, according to the different measures, and then computed the Kendall rank correlation. While the ranking according to (17) is strongly correlated with the ranking for \(u_e\) (Kendall is around 0.86), it is almost uncorrelated with \(u_a\).

In summary, the credal uncertainty score appears to be quite similar to the measure of epistemic uncertainty in EAU. As potential advantages of the latter, let us mention the following points. First, the degree of epistemic uncertainty is normalized and bounded, and thus easier to interpret. Second, it is complemented by a degree of aleatoric uncertainty—the two degrees are carefully distinguished and have a clear semantics. Third, handling candidate models in a graded manner, and modulating their influence according to their plausibility, appears to be more reasonable than creating an artificial separation into plausible and non-plausible models (i.e., the credal set and its complement).

5 Experiments

This section starts with a description of the experimental setting and the data sets used in the experiments. Some technical details, e.g., regarding the choice of the model parameters and instantiations of aleatoric and epistemic uncertainty, are deferred to Sects. 1.1 and 1.2 in the appendix. Finally, the results of the experiments are presented and analyzed.

5.1 Data sets and experimental setting

We perform experiments on binary classification data sets from the UCI repositoryFootnote 7, the properties of which are summarized in Table 1. To make sure that the data is amenable to all methods without the need for further preprocessing, we only selected data with numerical features. Each data set is randomly split into \(10\%\) training, \(80\%\) pool, and \(10\%\) test data. The training data is used to obtain an initial model. Then, in each iteration, the learner is allowed to evaluate the instances from the pool and query a (mini-)batch of these instances — according to the strategy of uncertainty sampling, the learner selects those instances with the highest degrees of uncertainty. The chosen instances are labelled (by an oracle or expert) and added to the training data \({\mathbf {D}}\), on which the model is then re-trained. The budget of the active learner is fixed to the size of the pool, and the performance of the classifiers is monitored over the entire active learning process. The whole procedure is repeated 1000 times and test accuracies are averaged. The following variants of uncertainty sampling are included in the experimental studies:

  • Rand: Random sampling

  • ENT: Standard uncertainty sampling based on the entropy measureFootnote 8 (1)

  • CEU: Conflicting-evidence uncertainty sampling (8)

  • IEU: Insufficient-evidence uncertainty sampling (9)

  • CU: Credal uncertainty samplingFootnote 9 (17)

  • EU: Epistemic uncertainty sampling(21)

  • AU: Aleatoric uncertainty sampling (22)

Table 1 Data sets used in the experiments

5.1.1 Local learning

By local learning, we refer to a class of non-parametric models that derive predictions from the training information in a local region of the instance space, for example the local neighborhood of a query instance (Bottou and Vapnik 1992; Cover and Hart 1967). As a simple example, we consider the Parzen window classifier (Chapelle 2005), to which most of the mentioned approaches can be applied in a quite straightforward way. For a given instance \(\varvec{x}\), we define the set of its neighbours as follows:

$$\begin{aligned} R(\varvec{x},\epsilon ) = \big \{ (\varvec{x}_i , y_i) \in {\mathbf {D}} \, \vert \, \Vert \varvec{x}_i - \varvec{x} \Vert \le \epsilon \big \} \, , \end{aligned}$$

where \(\epsilon\) is the width of the Parzen window. In binary classification, a local region \(R(\varvec{x},\epsilon )\) can be associated with a constant hypothesis \(h_\theta\), \(\theta \in \Theta = [0,1]\), where \(p_{\theta }(1 | \varvec{x}) = h_\theta (\varvec{x}) \equiv \theta\). With p and n the number of positive and negative instances, respectively, within a Parzen window \(R(\varvec{x}, \epsilon )\), the likelihood function and the maximum likelihood estimate are, respectively, given by

$$\begin{aligned} L(\theta )= \left( \begin{array}{c} p+n\\ p \\ \end{array} \right) \theta ^p (1-\theta )^n \, , \text { and } {\hat{\theta }} = \frac{p}{p+n} \, . \end{aligned}$$
(25)

Since the likelihood function is well-defined, we can determine the degrees of epistemic and aleatoric uncertainty as described in Sect. 3.3; we refer to Sect. A.1 for the technical details.

How to determine the width \(\epsilon\) of the Parzen window? This value is difficult to assess, and an appropriate choice strongly depends on properties of the data and the dimensionality of the instance space. Intuitively, it is even difficult to say in which range this value should lie. Therefore, instead of fixing \(\epsilon\), we fixed an absolute number K of neighbors in the training data, which is intuitively more meaningful and easier to interpret. A corresponding value of \(\epsilon\) is then determined in such a way that the average number of nearest neighbours of instances \(\varvec{x}_i\) in the training data \({\mathbf {D}}\) is just K. In other words, \(\epsilon\) is determined indirectly via K. Furthermore, since we are not, in the first place, interested in maximizing performance, but in analyzing the effectiveness of active learning approaches, we simply fix the neighborhood size K as the square root of the size of the data set (number of instances in the initial training and pool set) as suggested by Lall and Sharma (1996). A practical algorithm for determining \(\epsilon\) given K, and the way in which we handle empty Parzen windows, are also given in Sect. A.1.

In a similar way, the approach can be applied to decision tree learning (Quinlan 1986; Safavian and Landgrebe 1991). In fact, recall that a decision tree partitions the instance space \({\mathcal {X}}\) into (rectangular) regions \(R_1, \ldots , R_L\) (i.e., \(\bigcup _{i=1}^L R_i = {\mathcal {X}}\) and \(R_i \cap R_j = \emptyset\) for \(i \ne j\)) associated with corresponding leafs of the tree (each leaf node defines a region R). Again, in the case of binary classification, we can assume each region R to be associated with a constant hypothesis \(h_\theta\), \(\theta \in \Theta = [0,1]\), where \(h_\theta (\varvec{x}) \equiv \theta\) is the probability of the positive class. Therefore, degrees of epistemic and aleatoric uncertainty can be derived in the same way as described for Parzen window.

For the Parzen window classifier and decision treesFootnote 10, we fixed the batch size to \(1\%\) of the initial pool dataset. For the approach based on credal uncertainty (CU), we determine the lower and upper probabilities based on the number of positive and negative examples in a region, following the procedure described in "Appendix 1 and 2" of (Antonucci and Cuzzolin 2010). Note that the evidence-based approach (Sect. 3.1) is not immediately applicable to these learners, and therefore omitted from the experiments.

5.1.2 Logistic regression

In contrast to nonparametric, local learning methods such as the Parzen window classifier, logistic regression is a parametric class of linear models, and hence coming with comparatively restrictive assumptions. Recall that logistic regression assumes posterior probabilities to depend on feature vectors \(\varvec{x} = (x^1,\ldots , x^d) \in {\mathbb {R}}^d\) in the following way:

$$\begin{aligned} h(\varvec{x}) = p(1 \, \vert \, \varvec{x})= \frac{\exp \left( \theta _0 + \sum _{i= 1}^d \theta _i \, x^i \right) }{1 + \exp \left( \theta _0 + \sum _{i = 1}^d \theta _i \, x^i \right) } \, . \end{aligned}$$

This means that learning the model comes down to estimating a parameter vector \(\theta =(\theta _0, \ldots , \theta _d)\), which is commonly done through likelihood maximization (Menard 2002). For numerical stability, we employ \(L_2\)-regularization, which comes down to maximizing the following strictly concave function (Rennie 2005):

$$\begin{aligned} l(\theta ) = \log L(\theta )&= \sum _{n=1}^N y_n \left( \theta _0 + \sum _{i=1}^d \theta _i x_n^i \right) \end{aligned}$$
(26)
$$\begin{aligned}&- \sum _{n=1}^N \ln \left( 1+ \exp \left( \theta _0 + \sum _{i=1}^d \theta _i x_n^i \right) \right) - \frac{\gamma }{2}\sum _{i=1}^d \theta _i^2 \, , \end{aligned}$$
(27)

where the regularization term \(\gamma\) is fixed to 1. On the basis of this likelihood function, the degrees of epistemic and aleatoric uncertainty can again be determined as described in Sect. 3.3; as before, technical details and a practical algorithm are deferred to Sect. 1.2 in the appendix.

For the case of logistic regression, the evidence-based approach can be applied as well (cf. Sect. 3.1.3). Following Sharma and Bilgic (2017), we set the number of the top uncertain instances to be evaluated to 5 times of the batch size.

5.2 Results

As can be seen in Fig. 4, in the case of the Parzen window classifier, EU performs the best and AU the worst. Moreover, standard uncertainty sampling (ENT) and random sampling are in-between the two. This is in agreement with our expectations and supports our conjecture that, from an active learning point of view, epistemic uncertainty is the more useful information. Even if the improvements compared to ENT are not huge, they are still visible and quite consistent. The performance provided by CU is competitive to the one of EU, and again in agreement with our expectations — as discussed in Sect. 4.2, both CU and EU have the ability of capturing model uncertainty. The results for decision tree learning (cf. Fig. 5) are quite similar. Now, however, standard uncertainty sampling based on entropy performs worse, and the advantage of epistemic uncertainty sampling is even more pronounced.

Fig. 4
figure 4

Average accuracies (y-axis) for the Parzen window classifiers as a function of the number of instances queried from the pool (x-axis)

Fig. 5
figure 5

Average accuracies (y-axis) for the decision trees as a function of the number of instances queried from the pool (x-axis)

In the case of logistic regression (cf. Fig. 6), the picture looks a bit different. Here, epistemic, aleatoric, and standard uncertainty sampling perform more or less the same (and all significantly better than random sampling), whereas no general pattern can be drawn for the evidence-based uncertainty measures. As a plausible explanation, note that, in contrast to the local learning methods in the first experiment, logistic regression comes with a very strong learning bias in the form of a linearity assumption. Therefore, the epistemic (or model) uncertainty disappears quite quickly: The linear decision boundary stabilizes relatively early in the learning process, and then, the learner is rather “certain” about its predictions (regardless of whether this certainty is warranted or not). According to the logistic model, the uncertain cases are those closest to the current decision boundary, some of them with a slightly higher epistemic and others with a higher aleatoric uncertainty. In any case, all three methods, EU, AU, and ENT, are sampling near the decision boundary. Thus, it is hardly surprising that they show similar performance.

Fig. 6
figure 6

Average accuracies (y-axis) for logistic regression as a function of the number of instances queried from the pool (x-axis)

Overall, the experiments nevertheless confirm that, in the context of uncertainty sampling for active learning, epistemic uncertainty is a viable alternative to standard uncertainty measures like entropy: For local learning methods, in which epistemic uncertainty tends to be higher, epistemic uncertainty sampling improves upon standard uncertainty sampling, and for global methods with a strong learning bias, it performs at least on a par. Credal uncertainty, which behaves similarly to epistemic uncertainty (cf. Sect. 4.2), shows strong performance as well.

As an aside, note that the learning curves are not all monotone increasing, which might be surprising at first sight. Actually, however, this kind of behavior is not uncommon and may occur if a data set, in addition to useful examples, also comprises low-quality (e.g., noisy or otherwise misleading) instances. In this case, a strong active learning strategy may succeed in selecting the informative, high-quality examples first, leading to a good model with strong predictive performance. In the end, if the pool needs to be exhausted, the active learner is “forced” to pick the low-quality examples, too, thereby causing a drop in performance.

5.3 Influence of model bias

The results presented above suggest that epistemic uncertainty might be more advantageous for (active) learners with a low bias and less so for learners with a strong bias. To corroborate this conjecture, we conducted an additional experiment, using decision trees with a maximum depth limit as model classes. This allows for contolling the bias in a seamless manner: The higher the depth limit, the less restricted the model class, the lower the bias.

Figure 7 shows the learning curves for the depth limits \(\{2, 3, 5, 10\}\) on the data sets blood and QSAR. As expected, different depth limits appear to be optimal for different problems (the best limit for blood is 3, for QSAR 10). However, more interesting for our purpose is the slope of the learning curves, which indeed seem to support our conjecture: For epistemic uncertainty, the learning curves increase faster for larger and slower for lower depth limits — for aleatoric uncertainty, it is just the other way around. To make this even clearer, Fig. 8 plots the relative performance in comparison to standard (entropy-based) uncertainty sampling, i.e., the performance ratio, both for epistemic and aleatoric uncertainty. As can be seen, EU tends to be superior, because the ratio is mostly larger than 1, while the opposite holds for AU. Again more importantly, the depth limit (bias) is in perfect agreement with the “order” of the curves: The higher the limit, the better EU (worse AU) in terms of relative performance.

Fig. 7
figure 7

Average accuracies (y-axis) for decision tree as a function of the number of instances queried from the pool (x-axis)

Fig. 8
figure 8

Average accuracies (y-axis) for decision tree as a function of the number of instances queried from the pool (x-axis)

5.4 Uncertainty as stopping criterion

In a last experiment, we analyze the potential of epistemic uncertainty to serve as a stopping criterion for an active learning process. Indeed, this appears to be a rather natural idea, because epistemic uncertainty — as opposed to aleatoric or total uncertainty — reflects the state of knowledge of the learner, and the potential to improve this knowledge through additional data. If the epistemic uncertainty is low for all instances remaining in the pool, this suggests that almost nothing can be gained anymore through additional sampling.

The criterion just outlined is an instance of the third type of stopping criteria commonly used in active learning (Li and Sethi 2006; Zhu et al. 2010): The active learning process ends if

  • the training data set reaches a desired size;

  • a targeted performance level is achieved;

  • no informative examples are available anymore.

While it is difficult to pre-define either a desirable size of the training data set or a targeted performance level, the last criterion can be easily implemented by setting some predefined uncertainty threshold and stopping the active learning process if the degree of uncertainty falls below the threshold (Zhu et al. 2010).

The potential usefulness of epistemic uncertainty is confirmed by our results (shown in Fig. 9 for two data sets–results for the other data sets are similar and can be found in Sect. 2 in the appendix).

Fig. 9
figure 9

Average accuracies and degrees of epistemic uncertainty (mean and maximum over instances in the pool, y-axis) as functions of the number of instances queried from the pool (x-axis)

6 Conclusion

This paper reconsiders the principle of uncertainty sampling in active learning from the perspective of uncertainty modeling and quantification. More specifically, it starts from the supposition that, when it comes to the question which instances to select from a pool of candidates, a learner’s predictive uncertainty due to “not knowing” should be more relevant than its uncertainty due to confirmed randomness.

To corroborate this conjecture, we revisited recent approaches to uncertainty quantification in machine learning, with a specific emphasis on methods that allow for separating different types of uncertainty, and incorporated them in the general uncertainty sampling procedure. Following a comparison and critical discussion of these approaches, a series of experiments with different learning algorithms was conducted. In these experiments, a distinction between so-called epistemic and aleatoric uncertainty proved to be especially useful. More specifically, epistemic uncertainty sampling in the sense of uncertainty sampling based on measures of epistemic uncertainty in a prediction shows strong performance and consistently improves on standard uncertainty sampling. These results, which we interpret as clear evidence in favor of our conjecture, and indeed quite plausible: Epistemic and aleatoric uncertainty can be thought of, respectively, as the reducible and irreducible part of the total uncertainty. Consequently, querying an instance with a high epistemic uncertainty may provide useful information for the learner, whereas an aleatorically uncertain instance is unlikely to do so.

Given this affirmation, we are now encouraged to elaborate on epistemic uncertainty sampling in more depth, and to develop it in more sophistication. In this regard, there are various directions to be followed:

  • Depending on the underlying model class, the quantification of epistemic uncertainty based on the generic approach by Senge et al. (2014) can be computationally expensive. Therefore, efficient instantiations for important learning methods would be desirable.

  • Similar approaches for measuring epistemic uncertainty, which have been proposed in the literature more recently, should be investigated as possible alternatives (Depeweg et al. 2018).

  • Our experimental results suggest that a distinction between epistemic and aleatoric uncertainty is more useful for learners with a weak inductive bias and less useful for learners with a strong bias. This observation ought to be analyzed in more detail and corroborated by further experiments. In fact, the learning algorithms included in our study constitute extremes on this spectrum (Parzen classifier and decision trees have a very low bias, logistic regression a very strong one), and additional experiments with learners having a “mediocre” bias would certainly be useful.

  • Quite interestingly, the very notion of epistemic uncertainty seems to share many commonalities with other principles that have been suggested for active learning — probably not by chance. One example is so-called expected model change or the related principle of expected model output change (EMOC), where the idea is to query instances that, if added to the training data, are likely to cause large changes of the hypothesis or the predictions produced by the hypothesis (Freytag et al. 2014). According to our quantification of epistemic uncertainty (but also other formalizations), such instances should also have a high epistemic uncertainty. Therefore, epistemic uncertainty sampling seems to have much in common with EMOC, perhaps with the notable difference that the former looks at the uncertainty for a single instance, whereas the latter considers the expected change over all instances. Nevertheless, elaborating on this connection more closely seems to be worthwhile. The same holds true for another well-established active learning strategy, namely query-by-committee (QBC) approach (Seung et al. 1992). In fact, the diversity of the predictions of an ensemble of hypotheses, which is used as a selection criterion in QBC, has recently also been advocated as a suitable means for quantifying epistemic uncertainty (Shaker and Hüllermeier 2020).

  • It might be interesting to combine the quantification of uncertainty with the notion of relevance or representativeness in active learning (McCallum and Nigam 1998; Lindenbaum et al. 2004): How relevant is a certain improvement of the current model for the overall performance of the learner? In local learning algorithms, for example, epistemic uncertainty tends to be high in sparse regions of the instance space, so that an active learner is tempted to sample here. At the same time, however, such regions appear to be less important for the generalization performance, simply because future queries will more likely occur in dense regions. Overall, a small improvement in a dense region may thus be more beneficial than a big improvement in a sparse region. This observation motivates a kind of density weighting, i.e., the combination (multiplication) of an uncertainty degree with the (estimated) density of a data point (Krempl et al. 2015).

  • Last but not least, going beyond uncertainty sampling for binary classification as considered in this paper, the idea of epistemic uncertainty sampling should also be extended toward other learning problems, such as multi-class classification and regression.