Active learning algorithm through the lens of rejection arguments

Active learning is a paradigm of machine learning which aims at reducing the amount of labeled data needed to train a classifier. Its overall principle is to sequentially select the most informative data points, which amounts to determining the uncertainty of regions of the input space. The main challenge lies in building a procedure that is computationally efficient and that offers appealing theoretical properties; most of the current methods satisfy only one or the other. In this paper, we use the classification with rejection in a novel way to estimate the uncertain regions. We provide an active learning algorithm and prove its theoretical benefits under classical assumptions. In addition to the theoretical results, numerical experiments have been carried out on synthetic and non-synthetic datasets. These experiments provide empirical evidence that the use of rejection arguments in our active learning algorithm is beneficial and allows good performance in various statistical situations.


Introduction
The aim of machine learning consists in designing learning models that accurately maps a set of inputs from a space X called instance space to a set of outputs Y called label space.Nowadays, with the data deluge, obtaining a powerful learning model requires a lot of data from X to be labeled, which is time consuming in many modern applications such as speech recognition or text classification.This motivated the development of other paradigms beyond classical prediction tasks.In this paper, we focus on prediction in the binary classification setting, that is Y = {0, 1}.In this framework, one of the most studied techniques to deal with this specificity is the iterative supervised learning procedure called active learning (Cohn et al., 1994;Castro & Nowak, 2008;Balcan et al., 2009;Hanneke, 2011;Locatelli et al., 2017Locatelli et al., , 2018) that aims at reducing the data labeling effort by carefully selecting which data need to be labeled.The goal of active learning is to achieve a high rate of correct predictions while using as few labeled data as possible.One of the key principles of active learning is to identify at each step the region of the instance space where the label requests should be made, called uncertain region in this paper, also known as disagreement region in the active learning literature (Hanneke, 2007;Balcan et al., 2009;Dasgupta, 2011).Many techniques have been developed to this aim, both in parametric (Cohn et al., 1994;Hanneke, 2007;Balcan et al., 2009;Beygelzimer et al., 2009;Hanneke et al., 2014) and nonparametric setting (Minsker, 2012;Locatelli et al., 2017Locatelli et al., , 2018)).
In this paper, we are particularly interested in the nonparametric setting, where several computational difficulties have so far hampered the practical implementation of the proposed algorithms.For example, (Minsker, 2012) provides interesting theoretical results which partly motivated Locatelli et al. (2017Locatelli et al. ( , 2018) ) as well as the present work, but it fails to provide a computationally efficient way to estimate the uncertain region.To overcome these shortcomings, we present a new active learning algorithm using the paradigm called rejection.The latter typically allows the learning models to evaluate their confidence in each prediction and to possibly abstain from labeling an instance (i.e., "reject" this instance) when the confidence in the prediction of its label is too weak.This rejection will however be used in a novel way in this work to conveniently compute the uncertain region, as explained below.Rejection and active learning typically differ on how they are interested in this uncertain region.In rejection, the interest in the uncertain region appears after the design of a learning model, that rejects a test point in order to avoid a misprediction.This is very useful in some applications such as medical diagnosis where a misprediction can be dramatic.However, in active learning, the uncertain region is used during the training process to progressively improve the model's performance by requesting labels where the classification is difficult.In our algorithm, we use rejection at each step k of the training process to estimate the uncertain region A k ⊂ X based on the information gathered up to this step.Then some points are sampled from the region A k and their labels are requested.Based on these labeled examples, an estimator fk is provided, that is then used to assess for each x ∈ A k the confidence in the prediction.The points where the confidence is low are rejected and are considered to form the next uncertain region A k+1 , thereby progressively reducing the part of instance space X on which a model remains to be constructed.We study the rate of convergence with respect to the excess-risk of our nonparametric active learning algorithm based on histograms under classical smoothness assumptions.It turns out that combining active learning sampling together with rejection allows for optimal rates of convergence.Using numerical experiments on several datasets we also show that our active learning process can be efficiently applied to any off-the-shelf machine learning algorithm.The paper is organized as follows : in Section 2 we provide the background notions of active learning and rejection separately, then review some recent works that proposed to combine these two notions, although in a way that differs from ours.Then we describe our algorithm in Section 3 along with the theoretical guarantees about its rate of convergence.Practical considerations to take into account when applying our algorithm are discussed in Section 4. Numerical experiments are presented in Section 5 and we conclude the paper along with some perspectives for future work in Section 6.The full proof of our theoretical result is relegated to the Appendix.

Background
In this Section we review the literature related to active learning in Section 2.1, and the reject option framework 2.2.Thereafter, in Section 2.3 we provide a review on the use of the rejection in the context of active learning.

Active learning
Given an i.i.d.sample (X 1 , Y 1 ), . . ., (X n , Y n ) from an unknown probability distribution P defined on X × Y, the classification problem consists in designing a map g : X −→ Y from the instance space to the label space.However, building such mapping might become a tricky task in particular situations where the labeling process of input instances are only available through time-consuming or expensive requests to a so-called oracle.In such applications, one might however have access to a huge amount of unlabeled data from the instance space.This motivated the use of the active learning paradigm (Cohn et al., 1994) that aims at reducing the data labeling effort by carefully selecting which data to label.Active learning algorithms were initially designed according to somewhat heuristic principles (Settles, 1994) without theoretical guarantees on the convergence nor on the expected gain with respect to classical "passive" learning.The theory of active learning has then gradually developed (Cohn et al., 1994;Freund et al., 1997;Balcan et al., 2009;Hanneke, 2007;Dasgupta et al., 2007;Castro & Nowak, 2008;Minsker, 2012;Hanneke & Yang, 2015;Locatelli et al., 2018Locatelli et al., , 2017;;Kpotufe et al., 2022).We are particularly interested in the nonparametric setting, where regularity and noise assumptions are made on the regression function.Two types of regularity assumptions are made on the regression function.The first one was introduced in the seminal work by (Castro & Nowak, 2008) and was also used in (Locatelli et al., 2018), where it is assumed that the decision boundary {x, η(x) = 1 2 } (where η is the regression function) is the graph of a smooth function.The second one, which was used in (Minsker, 2012;Locatelli et al., 2017), assumes that the whole regression function is smooth.In this work, we will use similar regularity assumption as in (Minsker, 2012).Besides, the noise margin assumption corresponds to the so-called Tsybakov noise condition, and it was observed that it corresponds to the situation in which active learning can outperforms passive learning (Castro & Nowak, 2008).In this work, we design an efficient active learning algorithm, similar to that considered in (Minsker, 2012), but handling the uncertain region in an explicit and computationally tractable way using rejection.

Classification with reject option
In the present contribution, we borrow some techniques from learning with reject option.Indeed, as detailed in Section 3, a core component of our active strategy relies on the confidence we have on labels of the input instances.In contrast to the classical statistical learning framework where a label is provided for each observation x ∈ X , learning with reject option is based on the idea that an observation for which the confidence on the label is not high enough should not be labeled.From this perspective, given a prediction function g : X → Y, an instance x ∈ X can be either classified and the corresponding label is g(x) or rejected and no label is provided for x (according to the literature, the output for x is ∅ or any symbol as ⊕ meaning reject).A classifier with reject option g is then a measurable mapping g : X → Y ∪ {⊕}.Reject option has been first introduced in the classification setting in (Chow, 1957).More recently, and since the development of conformal prediction in (Vovk et al., 1999(Vovk et al., , 2005)), reject option has become more popular and has been brought up to date to meet the current challenges.The paper by (Herbei & Wegkamp, 2006) proposed the first statistical analysis of a classifier based on reject option.After these pioneer works, more papers on reject options appeared (e.g., (Naadeem et al., 2010;Grandvalet et al., 2009;Yuan & Wegkamp, 2010;Lei, 2014;Cortes et al., 2016;Denis & Hebiri, 2019) and references therein).They mainly differ on the way they take into account the reject option.In particular, we can distinguish three main approaches: i) use the reject option to unsure a predefined level of coverage; ii) use the reject option to unsure a pre-specified proportion of rejected data; iii) consider a loss that balances the coverage and the proportion of rejected data.It has been established that, while there is no best strategy, controlling the coverage requests more labeled data than controlling the rejection rate, which in turn asks more (unlabeled) data that the last strategy that does the trade-off.On the other hand this last approach does not control any of the two parameters.Reject option has also been used in different contexts, such as in regression (Vovk et al., 2005;Denis et al., 2020) or algorithmic fairness (Schreuder & Chzhen, 2021).These papers show how reject option can be used to efficiently solve issues that are intrinsic to the problem.

Active learning with reject option
Most active learning schemes mentioned in Section 2.1 attempt to find the most "informative" samples in a region close the decision boundary, called uncertain region or disagreement region.Some recent works have refined this idea by adding an option to abstain from labeling the points (i.e., reject) that are considered too close to the decision boundary.Although the intersection of rejection and active learning seems natural, their combination is fairly recent.Current studies can be grouped into two differents settings: the first one is focused on using reject option for improving performance guarantees of some standard active learning algorithms (Puchkin & Zhivotovskiy, 2021;Zhu & Nowak, 2022) and the second one is focused on providing a classifier which takes into account reject option (Shekhar et al., 2021;Shah & Manwani, 2020), similarly to the standard reject option setting (Herbei & Wegkamp, 2006;Denis & Hebiri, 2019).In the first setting, (Puchkin & Zhivotovskiy, 2021) considered the parametric framework, particularly the model misspecification.That is, given a class of classifiers F (which possibly do not contain the Bayes classifier), the aim is to find an estimator f which achieves minimum excess error of classification.By using the reject option, (Puchkin & Zhivotovskiy, 2021) proved that exponential savings in the number of label requests are possible in model misspecification under Massart noise assumption (Massart & Nédélec, 2006).Their algorithm is related to the disagreement-based approach (Hanneke, 2007;Balcan et al., 2009) and outputs an improper classifier f , that is f / ∈ F possibly.The work of (Puchkin & Zhivotovskiy, 2021) was extended by (Zhu & Nowak, 2022) which provides a more efficient active learning algorithm that overcomes the difficulty of computing the uncertain region.In (Zhu & Nowak, 2022), the authors build a classifier based on the rejection rule with exponential saving in labels, for which they establish risk bounds in a general parametric setting.At each trial, the classifier does not label points for which the doubt is substantial.This decision of abstaining from classifying a point is taken by considering a set of "good" classifiers among a parametric class of functions.In particular, a point is rejected if all "good" classifiers consider it as a difficult point, that is, the corresponding score is within the interval [1/2 − γ, 1/2 + γ], where γ is a (small) positive real value.However an analysis of this algorithm sheds light on three arguments.First, the score at point x should be evaluated for all "good" functions in the class.Second, tuning the parameter γ is not discussed and it might be tricky.Finally, the empirical performance of the proposed algorithm is not considered in the paper.
In the second setting, (Shekhar et al., 2021), considered the nonparametric framework under some smoothness and margin noise assumptions.The authors designed an active learning algorithm which outputs a classifier that takes into account the reject option in a standard way as in (Denis & Hebiri, 2019) by deciding not to label the instances which are located near to the decision boundary.In particular, the final outputted algorithm is a classifier with reject option.In their framework, they derived rates of convergence for an excess-risk dedicated to the reject option framework and showed that these rates are better to those obtained by the passive learning counterpart (Denis & Hebiri, 2019).However it is not obvious in this setting to obtain computationally tractable algorithms, among others because the hypothesis class needs to be restricted.In contrast, in the present paper, we focus on the classical active problem and derive rates of convergence for this problem, along with a practical implementation of the algorithm.

Contributions
The recent works mentioned in Section 2.3 (Puchkin & Zhivotovskiy, 2021;Shekhar et al., 2021;Zhu & Nowak, 2022) provide interesting theoretical contributions showing the interest of combining active learning and reject option.However the practical implementation of the related algorithms is not straightforward, notably because it is computationally difficult to estimate the uncertain region.
In this work, we use a peculiar combination of the rejection and active learning to propose an active learning which is easy to compute in practice.More precisely, our contributions are threefold: • We transform the typical classification with reject option framework (from Sections 2.2 and 2.3) to estimate the so-called uncertain region in a novel way.Not only does this methodology provide a computationally efficient algorithm for active learning, but it also can be remarkably applied to any off-the-shelf machine learning algorithm.This is a twofold major improvement over (Minsker, 2012).
• Beyond the appealing numerical properties of our procedure, we show that it achieves optimal rates of convergence for the misclassification risk and the active sampling under classical assumptions in this setting.
• We illustrate the benefit of our method in synthetic and real datasets.

Active learning algorithm with rejection
In this section, after introducing some general notations and definitions, we present our algorithm in a somewhat informal way, and then provide the theoretical guarantees under some classical assumptions.

Notations and definitions
Throughout this paper X denotes the instance space and Y = {0, 1} is the label space.Let P be the joint distribution of (X, Y ).We denote by Π the marginal probability over the instance space and by η(x) = P (Y = 1|X = x) the regression function.The performance of a classification rule g : X → {0, 1} is measured through the misclassification risk R(g) = P (g(X) = Y ).With this notation, the Bayes optimal rules that minimises the risk R over all measurable classification rules (Lugosi, 2002) is given by g * (x) = 1 {η(x)≥1/2} and we have: For any classification rule g, the excess risk is given by In this work, we consider the following active sampling scheme.For each A ⊂ X , and M ≥ 1, we can sample (X i , Y i ) 1≤i≤M i.i.d.random variables such that 1. for all i = 1, . . ., M , X i is distributed according to Π(.|A); 2. conditional on X i , the random variable Y i is distributed according to a Bernoulli random variable with parameter η(X i ).
As is commonly done in the active learning setting, we assume that the marginal distribution of X is known (Minsker, 2012;Locatelli et al., 2017).In the next paragraph, we describe our active algorithm for classification.As important tools that nicely merge the active sampling and the use of the rejection, we will pay a particular attention to the definition of the uncertain region and the rejection rate.

Overall description of the algorithm
With a fixed number of label requests N (called the budget), our overall objective is to provide an active learning algorithm which outputs a classifier that performs better than its passive counterpart.
The framework that we consider (Algorithm 1) is inspired from that developed in (Minsker, 2012), in which we incorporate rejection to estimate the uncertain region.
In the following, let (ε k ) k≥0 be a sequence of positive numbers.Let (N k ) k≥0 be a sequence defined such that N 0 = √ N and N k+1 = c N N k with c N > 1 (e.g., c N = 1.2 in Section 5).Furthermore, we consider A 0 = X = [0, 1] d the initial uncertain region, and thus ε 0 = 1.We construct a sequence of uncertain regions (A k ) k≥1 and for k ≥ 1, an estimator ηk of η on A k is provided.First, our algorithm performs an initialization phase: • Initially, the learner requests the labels Y of N 0 points X 1 , . . ., X N 0 sampled in A 0 according to Π 0 = Π.
• Based on the initial labeled data an estimator η0 of η on A 0 is computed and an initial classifier g η0 = 1 {η 0 ≥1/2} is provided.
Afterwards, our algorithm iterates over a finite number of steps until the label budget N has been reached.
Step k ≥ 1 is described below.
• Based on the previous uncertain region These (ε k ) k≥0 define explicitly the sequence of the rejection rates (Denis & Hebiri, 2019).
• This constant λ k is used to construct the current uncertain region A k which is the set where the previous classifier g ηk−1 (•) = 1 {η k−1 (•)≥1/2} might fail and thus abstains from labeling : • According to π (.|A k ) the learner samples i.i.d.
• The learner updates the classifier over the whole space X as follows After the iteration process, the resulting active classifier with rejection is defined point-wise as

Theoretical guarantees
This section is devoted to the theoretical properties of the proposed procedure under common assumptions which are presented in Section 3.3.1.Thereafter, we state our main result in Section 3.3.2that mainly shows that our algorithm achieves an optimal rate of convergence for the excess-risk when the considered classifier is the histogram rule.
Assumption 3.1 (Smoothness assumption).The regression function η is s-Lipschitz-continuous for some s ≥ 0, that is, for all x, z ∈ [0, 1] d : Assumption 3.2 (Strong density assumption).The marginal probability admits a density p X and there exist constants µ min , µ max > 0 such that for all x ∈ [0, 1] d with p X (x) > 0, we have: Assumption 3.1 imposes the regularity of the regression function η while Assumption 3.2 ensures in particular that the marginal distribution of X admits a density which is bounded from below.Furthermore, we also assume that f (X) admits a bounded density.
The random variable f (X) admits a bounded density (bounded by C > 0).
Assumption 3.3 has two important consequences.The first one is that the cumulative distribution function F f of f (X) is Lipschitz.The second one is that the so-called Margin assumption (Tsybakov, 2004) is fulfilled with margin parameter α = 1.This Margin assumption is also considered in (Minsker, 2012) for the study of optimal rates of convergence in the active learning framework.

Rates of convergence
In this section, we present our main theoretical result (Theorem 3.5) which highlights the performance of our algorithm.While our methodology can handle any machine learning algorithm for the estimation of the regression function η, we provide theoretical guarantee with the histogram rule (whose definition is recalled in Definition 3.4) for the estimation of the regression function at each step of the procedure described in Section 3.2, as in (Minsker, 2012).For completeness, we provide the full proof of our result in this particular case in the Appendix.
It is known that in the passive framework, the histogram rule achieves optimal rates of convergence (Devroye et al., 1996).
Theorem 3.5.Let N be the label budget, and δ ∈ 0, 1 2 .Let us assume that Assumptions 3.1, 3.2, and 3.3 are fulfilled.At each step k ≥ 0 of the algorithm presented in Section 3.2, we consider Then with probability at least 1 − δ, the resulting classifier defined in Equation(3.3) satisfies where O hides some constants and logarithmic factors.
The above result calls for several comments.First, our active classifier ĝ based on the histogram rule is optimal for the active sampling w.r.t. the misclassification risk up to some logarithmic factors (see (Minsker, 2012) for the minimax rates, by considering Lipschitz regression function and the margin parameter equal to 1.This rate is better than the classical minimax rate in passive learning under the strong density assumption which is of order N − 2 2+d , see for instance Audibert & Tsybakov (2007).Second, the sequence of the rejection rates (ε k ) k≥0 should be chosen in an optimal manner guided by our theoretical findings.In particular, for each k, the value of ε k is of the same order as an upper bound on the error w.r.t. the ∞ -norm of ηk−1 , valid with high probability.This value of the ε k is also linked to the probability of the uncertain region in the procedure proposed by Minsker (2012).However, the major different with the latter reference is that our rejection rate is explicit and then our algorithm can be efficiently computed due to the use of rejection arguments to determine the uncertain regions.Finally, let us notify that our work can easily be extended for Hölder regression functions with parameter β.Indeed, for β ≥ 1, we can consider a similar estimator as that introduced in Definition 3.4 with higher order histogram rule using smoothing kernel (Giné & Nickl, 2021).
Remark 3.6.Theorem 3.5 is established assuming the knowledge of the marginal distribution of X.This is a classical assumption in active learning that helps for sampling.However, it is possible to extend our result to unknown distributions at the price of an additional unlabeled sample and then an additional factor 1/ √ size of the unlabeled sample.
In view of the above remark, we discuss the practical implementation of our proposed algorithm in the following section.

Practical considerations
Some practical aspects of the procedure are discussed in Section 4.1 and a simple numerical illustration is provided in Section 4.2.The full numerical experiments are presented in Section 5.

Uncertain region
In this section, we discuss the effective computation of the uncertain regions.Let k ≥ 1 represent the current step k of our algorithm.We denote by Considering the randomized score fk−1 instead of fk−1 ensures that conditionally on D M , the cumulative distribution function of fk−1 (X, ζ), denoted by F fk−1 , is continuous.Therefore, it implies that λk = max t, Π fk−1 (X) Hence, λk is expressed simply as the ε k -quantile of the c.d.f.F fk−1 .To preserve the statistical properties of fk−1 , the parameter u is chosen sufficiently small (e.g., u → 0).Note that the computation of the c.d.f.F fk−1 requires the knowledge of the marginal distribution of X.In practice, this distribution may be unknown.In a second step, based on a unlabeled dataset where conditionally on the data, F fk−1 is the empirical c.d.f. of the random variable fk−1 (X, ζ): Furthermore, the unlabeled set D U M k is assumed to be independent of D M , and since it remains unlabeled, it does not contribute to the budget.Formally, the uncertain region A k is then defined as follows

Illustrative example
For illustrative purposes, a two-dimensional dataset of 10 6 data points was generated using a regression function η(x 1 , x 2 ) = 1 2 (1 + sin( πx 2 2 )).We chose the estimators ηk to be linear, to make the comparison with the best linear classifier (x 2 = 0) straightforward.The budget was set to N = 5000, and the sequences of N k and ε k were chosen as N k = 1.2 N k−1 and ε k = 0.95 ε k−1 , starting with N 0 = √ N and ε 0 = 1.The parameter M k was set to 150.A discussion of this choice of parameters can be found in Section 5.1.Figure 1 represents the situation after the step k = 2 of the algorithm.At step k = 1 and k = 2, λ k has been computed using (3.2), which allows to classify the points in Âk−1 \ Âk (represented in black for k = 1 and in brown for k = 2).For visualization purposes, the points remaining in Â2 have been colored according to their labels (y = 1 in green and y = 0 in blue), even though these labels are unknown at this step of the algorithm.The yellow points are those in Â2 whose label has already been requested to the oracle.At subsequent steps, points in A k are selected according to the rejection rates shown in the center part of Figure 1, which shows the theoretical reject rates (ε k , defined in Algorithm 1) in blue and the experimental ones (ε k , counted as the number of points effectively rejected) in red.The latter were computed by repeating the simulations 10 times, to present the average results along with the standard deviations in grey.As a whole, the rejection rate is well estimated with only M k = 150 unlabeled samples.However, the standard deviations indicates that the rejection rate is harder to control towards the end of the algorithm, because less points are available to estimate ε k .The resulting learning curves for passive and active procedures are represented on the right of Figure 1.As expected with this simplistic illustrative dataset, using active learning does not provide a substantial advantage in the long run (test precision = 0.817 ± 0.005 for active; 0.816 ± 0.003 for passive), because the optimal classifier is relatively easy to find in passive learning, even with noisy data.However, the right panel of Figure 1 shows that for a given small budget (e.g., N < 500), active learning converges faster than passive learning.This will be further examined in Section 5.

Algorithm 1: Active learning with rejection
Input: 5 Numerical experiments

Parameters choice and sampling strategy
This Section discusses some aspects of the practical implementation of our algorithm.

Parameters choice
To perform numerical experiments, a few parameters of our model have to be set.First, the sequence of rejection rates was defined such that ε k+1 = c ε ε k , with ε 0 = 1 and c ε ∈]0, 1[.If c ε is small, the uncertain region Âk will be small, which corresponds to an "aggressive" strategy where many points are considered to be correctly classified at each step.Conversely, if c ε is large, the strategy will be more "conservative".Second, the constant c N defines the sequence N k as N k = c N N k−1 and thus the number of points asked to the oracle at step k ( N k εk on line 17 of Algorithm 1).If c N is large, the algorithm will use many points at each step, thereby consuming the budget faster.A larger budget therefore allows a larger c N .Third, the number of points to build the initial classifier is theoretically set to N 0 = √ N .In practice, this number can be increased to get a better estimate of η0 .Using a larger N 0 will however consume the budget faster.Third, M k unlabeled data points in D U M k are used at each step to estimate λk .If M k is large, the estimation of λk will be more accurate.As these M k points remain unlabeled, they do not contribute to the budget, and M k could in principle be large.The only restriction is that at each step k these (unlabeled) points have to be sampled independently of the (labeled) points asked to the oracle, it indirectly limits the number of points available to the oracle.Several experiments (results not shown) indicate that M k ≥ 100 provides a reasonable estimate of λk .Finally, the parameter u in Section 4.1 has been set to 10 −5 .Its precise value does not affect much the results, as long as it remains close to 0. Unless otherwise stated, our numerical experiments were performed using a "conservative approach, with the parameters discussed above set to Sampling strategy We designed a sampling strategy that re-uses points whenever possible, using two recycling procedures explained below.This is not so important in our numerical experiments with synthetic data (Section 5.2), where 10 5 data points are used to mimic the theoretical situation with an "infinite" pool of data.However it can become crucial in practical applications with limited labeled data, as in the non-synthetic datasets used in Section 5.3.The first recycling procedure is that the unlabeled points from step k − 1 will be re-used at step k.This does not invalidate our theory just because of the additive form of the risk over cells A k .Indeed, our trained estimator has the form ĝ(•) = k ĝk (•)1 A k (•) and then its overall risk R(ĝ) can be decomposed on the different regions A k (by conditioning on the data used to approximate the region from the previous iteration).
The second recycling procedure is that the data already labeled by the oracle at previous iterations (up to k − 1 included) are reused to train ηk , as long as they belong to the region A k .A similar procedure was used in (Urner et al., 2013).This allows to improve the estimation of ηk and to limit the budget consumption.This sampling strategy is permitted because of the expression of the estimator and the decomposition of the risk as noted above.It is particularly useful in practical applications where the total amount of labeled data is limited.

Synthetic datasets
Setting These numerical experiments were performed using 10 5 data points with a budget of N = 5000.The accuracy was tested on an independent test set of 5000 points, that were never used at any step in the algorithm.The parameters are set according to Section 5.1.
The algorithm was first challenged on three synthetic two-dimensional binary datasets (named dataset 1, 2, and 3, respectively), to study cases in which it is favorable.Dataset 1 aims at reproducing in two dimensions a toy example used by (Dasgupta, 2011), where the best linear classifier is located at x 1 = −0.3but active learning algorithms could be misled to x 1 = 0. Dataset 2 represents a situation where some data (x 1 < 0) are easy to classify while others (x 1 > 0) are not.Dataset 3 is a mixture of Gaussian distributions, whose parameters can be adjusted to create various degrees of overlap.The results presented here correspond to σ = 0.3.The datasets are presented on Figure 2 as well as the corresponding learning curves for our active learning algorithm and its passive counterpart in the case of several classifiers: linear SVM, SVM with a Gaussian kernel, random forests and k nearest neighbors.These classifiers are from the scikit-learn library (Pedregosa et al., 2011).Several parameters were tested, with similar results.The results in Table 1 are with the following parameters: regularization constant C = 5 for SVM, 100 trees for random forests, k = 5 for kNN.The other parameters are kept to their default value.
Results for datasets 1 and 2 In the case of SVM linear classifiers, our active learning algorithm is clearly superior to its passive counterpart for datasets 1 and 2, either with the larger budget (N = 5000) or with the smaller budget (N = 200).The situation is similar for SVM with Gaussian kernel, although it is less pronounced for dataset 2 at large budget.In the case of random forests and kNN, the difference is barely noticeable at large budget, but our algorithm is clearly superior with the smaller budget.
Results for dataset 3 Dataset 3 was designed to represent an easier classification problem.In this case our active learning algorithm does not present any advantage, although it does not significantly deteriorates the results (only slightly for SVM with Gaussian kernel).

Non-synthetic datasets
Several experiments were performed with various dataset from the UCI machine learning repository.Three "large" (more than 10000 data points) were used: skin (245057 points in R points in R 113 ) and EEG (14980 points in R 14 ).For those "large" datasets a maximum budget of N = 3000 was used.Three "small" (less than 1000 data points) were also considered: breast (683 points in R 10 ), cleveland (297 points in R 13 ), credit (690 points in R 14 ).For those "small" datasets a maximum budget of N = 500 was used.
The results for the largest dataset (skin) are presented as learning curves on Figure 3.All results are summarized in Table 2.These results indicate that for the skin and fraud datasets, the converged accuracy (at large budget) is superior for active learning in the case of SVM linear, but very similar for the other classifiers.This is partially due to the fact that the resulting active classifier is not linear anymore.However, when the budget is limited to smaller values (see the inserts of Figure 3), the active learning procedure provides a clear advantage.
The picture remains unchanged when we consider the "small" datasets.Indeed, most of the time the active method improves the passive one (see Table 3).However, this improvement is rather limited, expect for cleveland dataset where the use of the active algorithm is particularly beneficial.

Summary of the results and discussion
The study on synthetic datasets shows that our active learning algorithm using rejection provides a clear advantage for the first two datasets, especially at low budget, but not for the third dataset.This indicates that our algorithm is most useful in situations where the classification problem is more difficult.
In non-synthetic datasets, the active learning procedure appears to be most effective on larger datasets.The explanation is as follows.For small datasets (e.g., a few hundreds points), the number of points N 0 has to be chosen quite small.The estimate η0 is thus likely to be inaccurate, which in turn implies an inaccurate estimation of the uncertain region in the first steps and then leads to a poorly controlled algorithm.Interestingly, even in such small datasets, our algorithm is rarely detrimental to the final precision reached and can even be useful when the budget is extremely limited.

Conclusion and perspectives
Recently several works have started to combine active learning and rejection arguments by abstaining to label some data within an active learning algorithm.This combination is very natural since active learning and rejection both focus on the most difficult data to classify.In this work, instead of completely abstaining to label some data, we use rejection principles in a novel way to estimate the uncertain region typically used in active learning algorithms.We therefore propose a computationally efficient active learning algorithm that combines active learning with rejection.We theoretically prove the merits of our algorithm and show through several numerical experiments that it can be efficiently applied to any off-the-shelf machine learning algorithm.The benefits are more pronounced when the label budget is limited, which is promising for practical applications.Nevertheless, in the last steps of our algorithm the uncertainty about the label of some points can become very substantial, in which case it becomes natural to completely abstain from labeling.This abstention will be included in future work combined with our use of the reject option.Because the constant c 5 in (A.4) depends on L, we provide below a result which states that the variable L defined in (A.2) does not affect drastically the bounds in (A.4).
Lemma A.2 (Bounds on the maximum number of steps L).
Let us consider the variable L defined in (A.2), we have:

Proof.
By definition of L, we have where C is the bound on the density f provided in Assumption 3.3.Using again Assumption 3.3 we can write We then deduce that for all t ∈ (1/2, 1), conditional on the data Given iteration k ∈ {0, . . ., L − 1}, we set tk = ηk − η ∞,A k , and t k = 1 2 + tk .Thanks to (B.1), with η = ηk and t = t k , we deduce that (conditional on A k ) Then, in the event E, we have that the data that have been sampled until step k.The random variable fk−1 is the score function built at step k − 1.The construction of the uncertain region A k relies on λ k which is solution of Equation (3.2).First of all, we randomize the score function fk−1 by introducing a variable ζ distributed according to a Uniform distribution on [0, u] independent of D M and by defining the randomized score function fk−1 as fk−1 (X, ζ) = fk−1 (X) + ζ .

Figure 1 :
Figure 1: Left: Illustrative dataset after the step k = 2 of the algorithm.The points in black belong to Â0 \ Â1 and the brown ones to Â1 \ Â2 .In Â2 are the yellow points whose label have been requested to the oracle and the remaining points in green and blue correspond to y = 1 and y = 0, respectively.Center: theoretical (ε k , blue) and experimental (ε k , red with error bars in grey) rejection rates.Right: active vs. passive learning curves.

Figure 2 :
Figure 2: Top : From left to right, synthetic datasets 1, 2, and 3 used in this study with the points colored in blue or cyan depending on their class.Bottom : corresponding learning curves for active and passive linear classifiers.

Table 2 :
Results on "large" non-synthetic datasets with several classifiers for active and passive procedures, with a budget of N = 3000.

Table 3 :
Results on three "small" non-synthetic datasets with several classifiers and a budget not to exceed 500.