Toward optimal probabilistic active learning using a Bayesian approach

Gathering labeled data to train well-performing machine learning models is one of the critical challenges in many applications. Active learning aims at reducing the labeling costs by an efficient and effective allocation of costly labeling resources. In this article, we propose a decision-theoretic selection strategy that (1) directly optimizes the gain in misclassification error, and (2) uses a Bayesian approach by introducing a conjugate prior distribution to determine the class posterior to deal with uncertainties. By reformulating existing selection strategies within our proposed model, we can explain which aspects are not covered in current state-of-the-art and why this leads to the superior performance of our approach. Extensive experiments on a large variety of datasets and different kernels validate our claims.


Introduction
To train classifiers with machine learning algorithms in a supervised manner, we need labeled data. Whereas gathering unlabeled instances is easy, the annotation with class labels is often expensive, exhaustive, or time-consuming and needs, consequently, to be optimized. Active learning (AL) algorithms aim to reduce annotation costs efficiently and effectively (Settles, 2009). For that purpose, a selection strategy successively chooses the most useful labeling candidate from the pool of unlabeled instances and acquires the corresponding label from an oracle.
Our approach builds on three pillars: (1) We approximate the usefulness of one candidate on a representative subset, as mentioned in "toward optimal AL" by Roy & McCallum (2001). (2) We estimate the usefulness by determining the decision-theoretic gain in performance, as mentioned in "probabilistic AL" by Kottke et al. (2016). (3) We use a Bayesian approach and introduce a conjugate prior distribution to calculate the predictive posterior distribution.
Thereby, we consider the certainty of a classifier on its predictions (Murphy, 2006). As indicated in italic font, these pillars explain our choice of the title of this article.
The contributions of this article are as follows: • We propose a universal model for decision-theoretic AL, called xPAL, which calculates the gain in performance using a Bayesian approach.
• By simplifying our model, we prove equivalence to existing AL methods and show how this simplification affects the selection of candidates.
• Our experiments on 22 datasets confirm the superiority of our approach compared to several baselines and the robustness of our prior parameter.
The remainder of this article is structured as follows: First, we discuss related work in Sec. 2. In Sec. 3, we define our problem and provide the foundations for our model. In Sec. 4, we propose our new method xPAL and show how it theoretically and empirically relates to state-of-the-art approaches in Sec. 5. We evaluate our results experimentally and discuss our key findings in Sec. 6. We close this article with a conclusion and an outlook on our future work in that field.

Related Work
The central component of an AL algorithm is the selection strategy. The most naïve one is to choose the next candidate randomly (Settles, 2009). A common heuristic is uncertainty sampling (Lewis & Gale, 1994). The idea is to use, e. g., the estimated class posteriors of probabilistic classifiers or the distance to the decision boundary to build a usefulness score (Settles, 2012). This exploits the current classification hypothesis by labeling instances close to the decision boundary. In contrast to density-based approaches (Nguyen & Smeulders, 2004), it ignores the representativeness of selected instances for the entire training set, and fails to perform exploration (Bondu et al., 2010;Osugi et al., 2005). That is, it does not search the instance space for large regions with incorrect classifications. This might lead to even worse performance compared to random sampling (Settles, 2012). Hence, there exist variants that add random sampling (Žliobaitė et al., 2014;Thrun & Möller, 1992), use reinforcement learning (Osugi et al., 2005) or simulated annealing (Zoller & Buhmann, 2000) to balance exploitation and exploration, or combine it with a density weight (Donmez et al., 2007) and a variety of further factors, including sample diversity (Weigl et al., 2015;Xu et al., 2007;Brinker, 2003) and class priors (Calma et al., 2018).
Uncertainty sampling is a special case of adaptive submodular maximization (Cuong et al., 2014), and several works have established links between submodularity and AL (Cuong et al., 2014;Golovin & Krause, 2010;Guillory & Bilmes, 2010). An example for a recent approach, built on these works, is filtered active submodular selection (FASS) (Wei et al., 2015). FASS combines uncertainty sampling with a submodular data subset selection framework, capturing both sample informativeness and representativeness. For Gaussian Process classifiers, a Bayesian information theoretic AL approach is Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011). BALD aims to select instances with the highest marginal uncertainty about the class label but simultaneously high confidence for the individual settings of the model's parameters.
The query by committee (QBC) method (Seung et al., 1992) builds classifier ensembles and aims to reduce the disagreement between them. To improve balancing of exploration and exploitation in ensembles of active learners, Baram et al. (2004) proposed a formulation as a multi-armed bandit problem. Here, each active learner corresponds to one slot machine whose relative progress in performance is tracked over time, and on each trial one active learner is chosen for selecting an instance using the EXP4 algorithm. Furthermore, reinforcement learning approaches have been proposed that learn a policy for selecting active learners, for example by modelling active learning as a Markov decision process (Konyushkova et al., 2018).
In 2001, Roy & McCallum (2001) proposed expected error reduction. As shortly addressed in the introduction, they aim to estimate the expected generalization error if a candidate gets an additional label. Thus, they simulate each label for each labeling candidate and evaluate the mean error using the unlabeled instances. To estimate the probabilities, they use the class posteriors provided by probabilistic classifiers. Chapelle (2005) noticed that these estimates are highly unreliable (esp. at the beginning of the training) and therefore suggested the use of a beta prior. Kottke et al. (2016) address the issue pointed out by Chapelle and named their approach probabilistic AL. They propose to use a distribution of the class posterior probability instead of using the classifier outputs directly. Calculating the expectation over this posterior leads to a decisiontheoretic approach that gets rid of the parameter of Chapelle and leads to a mathematically sound approach.

Problem Formulation and Foundations
In "The Nature of Statistical Learning Theory," Vapnik (1995) introduced a holistic concept on how to learn from examples. He defined three different components that take part in such a process, namely a generator, a supervisor, and a learning machine. 1 The generator creates random vectors x ∈ R D (D-dimensional feature space) independently drawn from a fixed but unknown probability distribution p(x). The supervisor provides class labels y ∈ Y = {1, . . . , C} (C is the number of classes) for every instance x according to a conditional distribution p(y|x) which is also fixed but unknown. In our case, the learning machine is a classifier f θ (x) with some parameters θ. The goal is to choose that learning machine that approximates the supervisor's response best.
We adopt the above definition for the active learning scenario by refining the role of the (omniscient) supervisor: Definition 1 (Supervisor) A supervisor consists of: 1. A ground truth which is an unknown but fixed, deterministic function t : R D → [0, 1] C that maps an instance x to a probability vector p = t(x) with C i=1 p i = 1. Each element describes the true probability for the corresponding class given the instance x.
2. An oracle which provides a class label y ∈ Y for every instance x according to the ground truth p = t(x). Hence, the label is sampled from a categorical distribution y ∼ Cat(t(x)).
We visualize the learning process in Fig. 1. The generator provides instances x for which the oracle provides the class label y based on the ground truth t(x) = p = (p 1 , . . . , p C ). Unfortunately, we solely have information about the instance-label-pair (x, y) but not on the generator, the ground truth, or the oracle.  1 We adapt the terms and notation slightly. We use calligraphy for sets, bold font for vectors, and p(·) is either the probability density function or the probability mass of a discrete probability space. Please note the difference between p and p(·) (the latter is always a function).
In the technical community, the process of data generation is often described from a model-driven perspective: Then it is assumed that each class y has its own data generator p(x|y). Hence, every instance x has exactly one label, which is also called ground truth. Due to noise during data generation, different classes might appear in the same region, but still, the true label exists. Our view (as given in Def. 1 and Fig. 1) is purely data-driven: Looking at the data, we do not know why there are different labels in the same region. It could be due to noise in the data generation or due to the imperfectness of the oracle. When learning a classifier, the reason does not matter: We only observe that the oracle provides different labels for similar instances according to some proportion p which we call ground truth.
In the field of active learning, we assume to have an unlabeled dataset U = {x 1 , . . . , x N } (candidate pool) given by the generator. Labels are usually not available at the beginning but can be acquired from the oracle (Settles, 2009), which chooses the label according to the ground truth.
A selection strategy selects an instance x ∈ U, and we acquire the corresponding label y ∈ Y from the oracle. We remove the newly labeled instance from the candidate pool U ← U \ {x}, add the instance-label-pair to the labeled set L ← L ∪ {(x, y)}, and retrain the classifier on L.
We use a kernel-based classifier with kernel K which describes the similarity of two instances x and x . In our experiments, we use three different kernels (see Sec. 6) but our method is not restricted to these kernels.
Definition 2 (Kernel Frequency Estimate) The kernel frequency estimate k L x of an instance x is determined using the set of labeled instances L. The y-th element of that C-dimensional vector describes the similarity-weighted number of labels of class y: 2 We denote f L as a classifier which uses the labeled data L for training. 3 Similar to the Parzen Window Classifier (PWC) used in Chapelle (2005), the classifier f L predicts the most frequent class: Our method requires estimating kernel frequencies which is straight-forward for the PWC but also possible for other classifiers. For example, Beyer et al. (2015) estimates kernel frequencies (called label statistics) for Naive Bayes, k-Nearest Neighbour, and Tree-Based classifiers.
2 1 cond denotes the indicator function which returns 1 if cond is true and 0 otherwise. 3 To simplify the notation, we do not mention the parameters θ.

Toward Optimal Probabilistic Active Learning using a Bayesian Prior
The idea of our approach is to estimate the expected performance gain that a new instance would provide if we would acquire its label from the oracle. Then, we select the most promising instance for actual labeling. Within the next subsections, we explain the necessary steps towards the final method.

Estimating the Risk
In this article, we use the misclassification error as our performance measure (this can easily be changed). To optimize this performance, we minimize the estimated risk using the zero-one loss similarly to Vapnik (1995).
Definition 3 (Risk, Zero-one Loss) The risk describes the expected value of the loss L with respect to the joint distribution p(x, y) given a classifier f L : The zero-one loss returns 0 if the prediction of the classifier f L (x) is equal to the true class y and 1 otherwise: As the generator p(x) is not observable, we use a Monte-Carlo integration using a set of instances E which is able to represent the generator. For simplicity, we use the complete set of available instances, i. e. the labeled and the unlabeled data (E = {x : (x, y) ∈ L} ∪ U). Following the notation of Japkowicz & Shah (2011), we calculate the empirical risk R E as follows:

Introducing a Conjugate Prior
The conditional class probability p(y|x) from Eq. (7) depends on the ground truth t which is unknown (see Fig. 1): As a consequence, the probability p(y|x) is exactly the y-th element of the unknown ground truth vector p. We can use the nearby labels from L (represented in k L x , Def. 2) to estimate the ground truth p as the oracle provides the labels according to p (see Fig. 1). With increasing number of labels, our estimate converges to the correct ground truth. For estimation, we use a Bayesian approach by determining the posterior predictive distribution, i. e. calculating the expected value over all possible ground truth values p (see Murphy (2006) for details on predictive distributions): To determine the posterior probability p(p|k L x ) of the ground truth p at instance x, we use Bayes' theorem in Eq. (10). The likelihood p(k L x |p) is a multinomial distribution as each label y has been drawn from Cat(y|p) (see Fig. 1). 4 We introduce a prior p(p) which we choose to be a Dirichlet distribution with parameter α ∈ R C as this is the conjugate prior of the multinomial distribution. We choose an indifferent prior and set each element to the same value (α 1 = . . . = α C ∈ R >0 ) such that none of the classes is favoured. Using this prior can be seen as adding α y pseudoinstances to every class y (Bishop, 2006, p. 77). This means that in case of high values of α, we need many labeled instances (i. e., high frequency estimates k L x ) to get distinct posterior probabilities.
As we use the conjugate prior of the multinomial likelihood, there exists an analytic solution for the posterior which is a Dirichlet distribution (Murphy, 2006).
= Multinom(k L x |p) · Dir(p|α) Multinom(k L x |p) · Dir(p|α) dp Now, we determine the conditional class probability p(y|k L x ) from Eq. (9) by calculating the expected value of the Dirichlet distribution (Murphy, 2006): = Dir(p|k L x + α) p y dp = The last term describes the y-th element of the normalized vector k L x + α. For normalization, we use the sum of all elements denoted as the 1-norm || · || 1 .

Risk Difference Using the Conjugate Prior
We insert Eq. (14) into the empirical risk (Eq. (7)). As we approximate p(y|x) with p(y|k L x ), this is an approximation of the empirical risk based on the labeled data L. Hence, we add L as an argument of the estimated empirical risk: We now assume that we add a new labeled candidate (x c , y c ) to the labeled set L and denote the new set To determine how much this new instance-label-pair improved the performance of our classifier f , we estimate the gain in terms of risk difference under the current observations k L + x :

The Expected Probabilistic Gain
If we reduce the error under the new model L + , the risk difference in Eq. (17) becomes negative. Therefore, we negate this term as we aim to maximize the gain in Def. 4.

Definition 4 (Expected Probabilistic Gain)
The probabilistic gain describes the expected change in classification risk R when acquiring the label y c of candidate x c ∈ U.
As the label y c and the corresponding ground truth t(x c ) are unknown, we estimate p(y c |x c ) with p(y c |k L xc ) according to Eq. (14) using Dir(β) as prior. We write For simplicity, we set β = α.
We define the selection strategy xPAL to choose the candidate that optimizes the xgain score.
Definition 5 (Selection Strategy: xPAL) The selection strategy xPAL (Expected Probabilistic Gain for AL) chooses this candidate x * c ∈ U with:

Theoretical and Qualitative Comparison
To provide an understanding of how the xPAL selection strategy works, we compare our new method to the most similar selection strategies by reformulating their approaches within our mathematical framework wherever possible. We provide the proofs for all theorems in the supplemental material. In Tab. 1, we summarize the primary differences and show the computational complexity.
In Fig. 2, we illustrate how the theoretical differences affect the actual choice of eight candidates on a toy dataset with two classes (blue diamonds and red rectangles). For classification, we use the same setup as in Sec. 6. The first eight labeled instances, chosen by the selection strategy, are marked with a gray circle. The background color shows how the respective selection strategy rates the usefulness of an area -darker areas are considered more useful than brighter areas.

Expected Probabilistic Gain for AL (xPAL)
As seen in Fig. 2, the currently labeled set L of xPAL is evenly spaced across the input space. That is, xPAL queried representative samples of the data set in the more explorative phase at the beginning, which leads to a rather good decision boundary with only eight labels. Focusing on the current usefulness scores indicated by green background color, we see that regions close to the decision boundary and regions with very few labels (green area at the bottom) are preferred. Moreover, we notice more usefulness at the right decision boundary compared to the left one as this area is seen as being more relevant (due to the higher density).

Expected Error Reduction (EER)
Theorem 1 The selection criterion of expected error reduction (EER) by Roy & McCallum (2001) can be written as follows. The extension of adding a beta-prior proposed by Chapelle (2005) is given in blue color.
Comparing Eq. (21) to Eq. (19), we see that there are only a few differences highlighted in orange color. The main difference is the optimization objective as expected error reduction tries to query instances that minimize the expected error instead of the expected gain as in xPAL. Second, EER neglects the labeled instances L as it only uses U for Monte-Carlo integration. They assume that the unlabeled instances approximate the generator p(x) sufficiently well. In the Table 1. Summary of differences between xPAL and the four most similar methods evaluated on four criteria: (1) Is the usefulness estimated on a representative subset? (2) Does the method consider the performance gain? (3) Is some sort of prior included to handle uncertainties? (4) What is the asymptotic time complexity for determining the usefulness of one candidate sample? (2001) point out that the posterior estimates need to be reliable. Later, Chapelle (2005) addresses this limitation by introducing a beta-prior (highlighted in blue), which serves a similar goal as our prior α.
Although the theoretical differences of the two strategies are small, we see a clear difference in the acquired instances and in the usefulness estimation in Fig. 2. Interestingly, the region close to the decision boundary is considered the least useful. Accordingly, EER neglects information there.

Probabilistic Active Learning (PAL)
Theorem 2 The selection criterion of (multi-class) probabilistic active learning (PAL) by Kottke et al. (2016) can be written as follows.
The probabilistic active learning approach by Kottke et al.
(2016) does not consider a set E for risk estimation but estimates the risk locally only for the candidate x c . Hence, we set E = {x c }. Instead, they include an estimated density weightp(x c ) for their local gain. As a prior distribution, they use the indifferent prior 1. The original method is non-myopic. As xPAL is myopic, we ignored this for the theoretical discussion.
In general, we see a similar acquisition behavior of PAL and xPAL (see Fig. 2). We see areas of high usefulness near the decision boundary and in sparely labeled regions. It seems that xPAL is more sensitive to the actual position of the instances as it considers the set E, and PAL only approximates this by using the densityp(x c ). Hence, the influence of a new label on the complete classification task is only approximated in PAL.  Figure 2. Visualization of acquisition behavior for different selection strategies. The green color indicates how useful a selection strategy considers a region. The usefulness depends on the selection criterion of the strategy. The eight labeled instances have been selected by the corresponding selection strategy. Thereby, one can see where the selection strategy selected instances in the past and how the usefulness is spatially distributed to select the next instance for labeling.

Uncertainty Sampling (US)
Theorem 3 The selection criterion of confidence-based uncertainty sampling (US) by Lewis & Gale (1994) can be written as follows.
Uncertainty sampling does not consider a set for risk estimation, but it solely estimates the error at the candidate x c based on the current observations without any prior. Hence, it completely relies on the class posterior estimates from the classifier. Therefore, it might overestimate its certainty.
We observe this problem in Fig. 2 as US only finds one decision boundary and sticks at exploiting this. As it is not aware that the class posteriors on the left are highly unreliable (no labeled data here), it will only consider this region if the labels of all other candidates have been acquired. We notice a lack of exploration.

Active Learning with Cost Embedding (ALCE)
The approach proposed by Huang & Lin (2016) uses an embedding with some special distance measure in a hidden space with non-metric multidimensional scaling. As this follows an entirely different way of approaching the problem, it is not possible to transfer this algorithm to our framework. As shown in Fig. 2, this approach explores the data space quite uniformly and is rather exploratory than exploitative.

Query by committee (QBC)
Query by committee (Seung et al., 1992) uses an ensemble of classifiers that are trained on bootstrapped replicates of the labeled set L. With few labels, the strategy explores the dataset due to high randomness in the subsets (see Fig. 2). Later, it starts exploiting more.

Experimental Evaluation
To evaluate the quantitative performance of xPAL, we conduct experiments on real-world datasets. 5 We provide information on the used datasets, algorithms, and the experimental setup. We compare xPAL to state-of-the-art methods and show how the prior parameter affects the results.

Datasets and Competitors
We selected 27 datasets from the openML library ( Vanschoren et al., 2013) and two pre-processed text datasets from Hernndez-Gonzlez et al. (2018) with TF-IDF features.
For the latter, we assigned the majority vote as the true class.
In the supplemental material, we list all used datasets with their openML-identifier and show specific characteristics such as the number of instances, features, and instances per class.
Next to xPAL, we use multi-class probabilistic AL (PAL) by Kottke et al. (2016), confidence-based uncertainty sampling (US) by Lewis & Gale (1994), active learning with cost embedding (ALCE) by Huang & Lin (2016), query by committee (QBC) by Seung et al. (1992), expected error reduction (EER) by Chapelle (2005), and a random selector. We set all parameters according to the default values in the paper. For QBC, the disagreement within the randomly drawn sets, measured by the Kulback-Leibler divergence, describes the usefulness of a candidate. We use 25 classifiers as the committee and each of them is trained on a bootstrapped version of L with only a selection of features according to (Shi et al., 2008).
Additionally, we implemented a baseline that has additional access to all labels of the unlabeled set U. It successively (greedily) selects the candidate, which minimizes the true empirical risk on U and L, called GREEDY-ALL. It is equal to xPAL where the estimated class probability from Eq. 8 is set to one for the true class.

Experimental Setup
To evaluate our experiments, we randomly split each dataset into a training set consisting of 60% of the instances and a test set containing the remaining 40% and repeat that 100 times. As we start without any labeled instances, U contains the whole training set at the beginning, and L is empty. We acquire 200 labels for every dataset or stop when U is empty.
For classification, we use the Parzen window classifier for all selection strategies. We applied three different kernels depending on the type of data. For numerical data, we zstandardize all features and use a radial basis function (RBF) 5 Code: https://github.com/dakot/probal kernel with bandwidth γ which is defined as follows: We set the bandwidth of the kernel (γ = 1/(2s 2 )) according to the mean criterion proposed by (Chaudhuri et al., 2017) with σ p = 1: For categorical data, we use the hamming-distance kernel proposed by Hutter et al. (2014) : where the hyperparameter γ is again determined through the mean bandwidth criterion.
For the text datasets which contain TF-IDF features, we apply the cosine similarity kernel

Comparison Between xPAL and Competitors
We visualize our results using learning curves in Fig. 3 and rank statistics in Fig. 4, 5, and 6. More results are given in the appendix. The learning curves show the misclassification error (averaged over the 100 repetitions) on the test set after each label acquisition for every combination of an algorithm and a dataset. The learning curve that reaches a low error fast is considered best.
Almost all learning curves show that the supervised baseline (GREEDY-ALL) performs perfectly in an early phase. This is not surprising as it knows all labels (even from the unlabeled set U) to optimize the error on the training set. As seen in steel-plates-fault, this baseline does not achieve the best performance in all cases because of the greedy selection (no look-ahead). In that example, an optimal baseline would need to create a strategy for more than just the upcoming candidate. Also, the xPAL approach (green, bold line) with α = 10 −3 performs well. For convenience, we plotted the xPAL also with α = 1 as another alternative. The differences between both curves are rather small.
As it remains difficult to quantitatively assess the performance due to the large amount of datasets, we provide the mean rank plot in Fig. 4, 5, and 6. For this purpose, we calculated the rank of the area under the learning curve for each of the 100 repetitions and average this rank for every combination of a selection strategy and a dataset. We      use color to visualize the performance: blue color means good rank, and red color indicates bad performance. The rank of the best algorithm is printed in bold. Moreover, we performed a Wilcoxon-signed-rank test to assess if the pairwise differences between xPAL and its competitors are significant. Three stars (***) indicate significantly better results of xPAL with a p-value of .001, two stars (**) indicate a p-value of 0.01 and one star (*) of .05. Analogously, significantly better performance of a competitor is shown with †. We yield the mean column (right) by averaging the ranks over all datasets. The pattern (a/b/c) in the second row of each cell summarizes a) the number of highly significant wins, c) the number of highly significant losses, and b) neither of both.
We separated the ranking plots w. r. t. the kernel function. Figure 4 shows results with the RBF kernel, Fig. 5 with the hamming-distance kernel, and Fig. 6 with the cosine similarity kernel. One can observe that xPAL has the lowest mean rank for all kernels and is always printed in blueish color across the datasets. No other algorithm performs as robust. The strongest competitor is PAL. But on the categorical data, we observe a clear performance difference between PAL and xPAL. One reason might be the difficulty of obtaining a reliable density estimation for categorical data.

Robustness of Prior Parameter
In Fig. 8, we show the mean ranking over all numerical datasets for different choices of priors α. Compared to the other strategies (left image), there is only a small difference across all choices. Comparing xPAL with α = 10 −3 to the other priors (right image), we see that there are datasets where the selected xPAL is significantly outperformed but in general, the effect is neglectable. Also, all mean ranks are between 3.27 and 3.63, which validates the robustness of our parameter. We propose to use α = 10 −3 as default.

Computation Time
In Tab. 1, we already showed the theoretical time complexity. In this section, we now show the actual computation time which of course also depends on the efficiency of the implementation. Therefore, we artificially generated datasets with 500, 1000, . . . , 2500 instances and 2, 4, 6 classes. With every selection strategy, we acquired 200 labels and report the mean computation time on a personal computer in Fig. 7. We clearly see the exponential behavior of EER which is also visible for xPAL. As xPAL only needs to calculate the  loss difference on instances, where the decision actually changes, we can reduce the computation time to a significant amount. Because of the inefficient optimization in PAL, we are even comparably fast to PAL for dataset with less than 1000 instances.

Conclusion
In this article, we moved toward optimal probabilistic AL by proposing xPAL. It is a decision-theoretic approach that determines the expected performance gain for labeling a candidate using a conjugate prior. We used this model to show the similarities and differences to the most related approaches and compared them by showing how each method selects their instances in a synthetic example. Moreover, we provide an exhaustive experimental evaluation indicating the superiority of xPAL and the robustness of its prior parameter.
In future work, we aim to apply this idea to other costsensitive loss functions and for error-prone annotators as this is a current limitation of this article. Moreover, we research possibilities to use the concept of xPAL to define a stopping criterion and to apply it for other classifier types. The combination of xPAL with methods of deep learning is also promising. However, several challenges need to be addressed, such as unreliable estimates of the class probabilities and the estimation of the vector k L xc . The former might be solvable by using techniques that improve the returned probabilities (e. g., by using Bayesian neural networks). The latter could be addressed by transforming samples into a latent representation (e. g., by using variational autoencoders). The resulting features would allow for a kernel density estimation. To extend this idea to regression problems, it will be necessary to combine the normally distributed output with a conjugate prior distribution (e. g., Gaussian-Wishart). This would allow for an analytic solution of the posterior which enables reliable estimation of the risk. (4) using a Monte-Carlo approach over P. They describe to use the unlabeled pool for that. In our work, we call this the candidate set U. Their algorithm consists of 4 steps: In short, they calculate the average expected loss for every instance x c ∈ U. Therefor, they consider every possible label y c ∈ Y and add the pair (x c , y c ) to the training set D (here: L). They call the resulting set D * (here L + ). The resulting expected losses are averaged, weighted with the respective posterior probability p(y c |x c ).
The posterior probabilities for our kernel-based classifier are determined using Eq. 28. Chapelle (2005) proposed to include a beta-prior and thereby extended the approach by Roy & McCallum (2001).
The resulting equation can be simplified as follows:

A.2. Proof of Theorem 2
Multi-class probabilistic active learning (PAL) by Kottke et al. (2016) describes the expected gain in accuracy. Instead of evaluating this gain on a representative subset, they solely consider the gain locally. To proof Theorem 2, we need to set the m parameter of PAL to m = 1, which means that we only consider one possible label acquisition in each iteration. Kottke et al. (2016) model the hypothetical labels using a labeling vector l ∈ N C which describes the number of potentially added labels for each class. As we only consider one label at a time (m = 1), these vectors are unit vectors with a 1 at element of the considered class y c and 0 otherwise. Hence, l ∈ {e 1 , . . . , e C }.
For simplicity, we use k instead of writing k xc as PAL solely considers the candidate x c and no other instance. Moreover, we know that k L + = k + e yc , as we increment the frequency estimate of the simulated class y c by 1 (the similarity of x c to x c is always 1). Additionally to l, Kottke et al. (2016) model the classifier's decision using a vector d, which is 1 for the class of the future decision and 0 otherwise.
For simplicity, we do not write the iterators at sums and products if they iterate from i = 1 to C. Based on the old classifier f L and the new classifier f L + , we writeŷ = f L (x c ) andŷ + = f L + (x c ) for the old and the new prediction.
We now insert I, II, III back into Eq. 37.
We divide the sum into two parts: (A) The subset of all labels (Y = ) that change the decision, (B) and the labels (Y = ) that do not change the decision. Please remember that a new label y c could change the decision ofŷ + as it includes the new label. They are defined as follows: Now, we consider both cases independently.
B) Labels that do not change the decision Here, we can use the following implications to rewrite the cases from Eq. 45 into the sum: In the last step, we use thatŷ =ŷ + =⇒ L(ŷ,ŷ + ) − L(ŷ,ŷ) = 1. Additionally, we use that y c =ŷ applies and thus kŷ = k L + y . Next, we combine both cases: Because of L(y,ŷ + ) − L(y,ŷ) = 0 for y / ∈ {y c ,ŷ} and for y c ∈ Y = , we can change this equation to A.3. Proof of Theorem 3 According to Settles (2009), the usefulness score for "least confidence uncertainty sampling" is determined by the following equation and can easily be rewritten. We denote:

C. More Experimental Results
In this section, we provide more plots from our experimental evaluation. Please refer to the original paper for the detailed explanation of the experimental setup and the discussion of the results.   Table 3 describes the averaged area under the learning curve including standard deviations and significance testing with the Wilcoxon signed rank test. The notation is similar to the one from the paper.    mean rank Figure 13. The mean rank for xPAL with different parameters and datasets across 100 repetitions. The best parameter is printed in bold.

C.4. Detailed Ranking Plots for Different Parameters
The Wilcoxon signed rank test shows pairwise significance between xPAL with α = 10 −3 and its competitor. Table 4 provides an overview of the execution times over the different selection strategies and datasets. The execution times are averaged over 100 repeated runs each with a maximum number of 200 instance selections. A single execution time entry indicates the average time in seconds to select a single instance for a given dataset and selection strategy. The execution times are primarily depended on the number of instances but also on aspects like the number of features and classes as calculations might become more complex.

D. Execution Times and Computing Infrastructure
All experiments were run on an heterogeneous computer cluster which might lead to irregular results as the speed between the cluster nodes vary.  Table 4. Execution times in seconds for one single instance averaged over all repetitions and acquisitions.