An Improved Boundary Uncertainty-Based Estimation for Classifier Evaluation

This paper proposes a new boundary uncertainty-based estimation method that has significantly higher accuracy, scalability, and applicability than our previously proposed boundary uncertainty estimation method. In our previous work, we introduced a new classifier evaluation metric that we termed “boundary uncertainty.” The name “boundary uncertainty” comes from evaluating the classifier based solely on measuring the equality between class posterior probabilities along the classifier boundary; satisfaction of such equality can be described as “uncertainty” along the classifier boundary. We also introduced a method to estimate this new evaluation metric. By focusing solely on the classifier boundary to evaluate its uncertainty, boundary uncertainty defines an easier estimation target that can be accurately estimated based directly on a finite training set without using a validation set. Regardless of the dataset, boundary uncertainty is defined between 0 and 1, where 1 indicates whether probability estimation for the Bayes error is achieved. We call our previous boundary uncertainty estimation method “Proposal 1” in order to contrast it with the new method introduced in this paper, which we call “Proposal 2.” Using Proposal 1, we performed successful classifier evaluation on real-world data and supported it with theoretical analysis. However, Proposal 1 suffered from accuracy, scalability, and applicability limitations owing to the difficulty of finding the location of a classifier boundary in a multidimensional sample space. The novelty of Proposal 2 is that it locally reformalizes boundary uncertainty in a single dimension that focuses on the classifier boundary. This convenient reduction with a focus toward the classifier boundary provides the new method’s significant improvements. In classifier evaluation experiments on Support Vector Machines (SVM) and MultiLayer Perceptron (MLP), we demonstrate that Proposal 2 offers a competitive classifier evaluation accuracy compared to a benchmark Cross Validation (CV) method as well as much higher scalability than both CV and Proposal 1.


Introduction
A fundamental problem in statistics, machine learning, and pattern classification is to obtain an accurate estimate for the generalization ability of a learning algorithm trained on a finite dataset. The generalization ability of a pattern classifier is traditionally measured in terms of classification error probability. The goal of pattern classification is to execute the optimal classification rule (aka a Bayes decision This work was supported in part by JSPS KAKENHI No. 18H03266. David Ha davidhafr@gmail.com 1 Doshisha University, Kyotanabe-shi, Kyoto-Fu, Japan rule) that corresponds to the minimum classification error probability (aka a Bayes error). However, the estimate of the error probability based on a finite amount of training data is so seriously biased that it cannot directly indicate the error probability [1].
A major conventional approach to classification error probability is Structural Risk Minimization (SRM) [2]. SRM provides an analytic estimation of the classification error probability. However, the resulting estimation can be loose in practice. Furthermore, the difficulty of deriving the necessary SRM in a case-by-case manner hinders its application [1,3].
Hold Out (HO) evaluation bypasses the bias intrinsic to the training data by splitting the available data into training data and validation data. Then, the empirical error rate of the validation data is used as an estimate of the error probability. However, the error probability estimate obtained from a particular validation split may also be biased. To resolve this issue, Cross Validation (CV) [4] averages the error probability estimates obtained in turn from different validation sets partitioned from the data. The particular case where every sample is subsequently used as a validation set is called Leave-One-Out (LOO). LOO is known to converge to the expected error probability [5].
Bootstrap [6] reduces the variance of the error probability estimation by averaging the estimations obtained from different training sets by sampling with replacement. One drawback of the resampling approaches (CV, Bootstrap) is their costly training repetition that can be prohibitive in realworld tasks [7], along with the sacrifice of separate data for evaluation. In principle, training and evaluation on the training data to directly target the Bayes error would be preferable, although this has been difficult so far.
Moreover, error rate smoothing methods can improve error probability estimation [8] based only on the training set. For example, Minimum Classification Error training [9] optimally determines the degree of smoothing of the empirical error rate to target the Bayes error. However, the settings and effect of this automatic smoothing are ongoing research issues [10].
Likelihood methods such as information criteria [11] and Bayesian model selection [12,13] rely on class posterior probability estimation instead of estimating the error probability. However, classification focuses the quality of the estimation on the boundaries that delineate class distributions, not on the class distributions themselves. Therefore, likelihood approaches are not necessarily optimal for classifier evaluation [14,15].
The limitations described above come from the intrinsic difficulty of estimating the error probability. To circumvent these limitations, we proposed finding the optimally trained classifier through a new classifier evaluation metric that is uniquely easy to estimate in principle, which we termed "boundary uncertainty" or alternately "Bayes-boundaryness" [16][17][18]. We chose the name "boundary uncertainty" because this evaluation metric measures the generalization ability of a classifier based on how equal class posterior probabilities are along the classifier boundary. It is known that the optimal classifier boundary is defined by equality among the class posterior probabilities, and this situation can be described as "uncertainty" along the boundary. Boundary uncertainty is fundamentally easier to estimate for two reasons: focusing the estimation on the classifier boundary instead of integrating estimations over the entire multidimensional space is easier, and the classifier boundary is defined based on classifier parameters that are precisely known. Furthermore, the value of boundary uncertainty implies how close we are to the Bayes risk.
In addition to defining boundary uncertainty, our previous work also proposed a boundary uncertainty estimation method that here we refer to as "Proposal 1." Through experiments and theoretical analysis, we showed that Proposal 1 could perform classifier evaluation using neither sample distribution assumptions nor data resampling methods [16][17][18]. However, Proposal 1 relied on largely unclear settings and heavy treatment that seriously limited both its accuracy and its scalability. This motivates our work on a new boundary uncertainty estimation method that we call "Proposal 2" and comprehensively introduce in this paper.
The paper is organized as follows. In Section 2, we prepare the formalization of the classification problem. In Section 3, we summarize the previous method, i.e., Proposal 1. In Section 4, we introduce Proposal 2 as a new method. Then, in Section 5, we discuss the time costs of classifier evaluation for the two methods. In Section 6, we perform experiments using these two methods and provide extensive discussion. In Section 7, we finally summarize the paper. Table 1 presents the main notations used in this paper. The first column of the table indicates the section where each notation is first used.

Bayes Decision Rule
We assume a d-dimensional sample space X , where we intend to discriminate between J classes. Given j ∈ [ [1, J ]] and x ∈ X , we denote by P (C j |x) the (true) class posterior probability of C j given x. The goal of a classifier is to minimize the following classification risk: We can see that Eq. 1 operates over the entire multidimensional sample space. C(x) represents the classification decision for sample x (decision is one of the classes C r , where r ∈ [[1, J ]]). R(C(x)|x) represents the risk when assigning class C(x) to x: where λ(C r |C j ) denotes the cost of classifying a member of class C j as a member of class C r . We assume the following cost: This leads to the risk R(C r |x) = 1 − P (C r |x). This risk is minimized by choosing C r such that P (C r |x) is the highest among all of the class posterior probabilities at x. Based on this consideration, the decision rule that achieves the  Given Λ, time cost of one classification in the sample space that has d dimensions and J classes Section 5.1 c (2) (Λ) Given Λ, time costs of Proposal 2 to estimate U(Λ) Section 5.1 c (2) shared Preliminary time cost of Proposal 2 to estimate U(Λ), which is shared across L candidate classifier statuses and applied only once, regardless of L Section 5.1 R K Number of values that we try for K Section 5.2 c (1) (Λ) Given Λ, time costs of Proposal 2 to estimate U(Λ) minimum classification error probability (aka a Bayes error) is [19] This optimal classification decision is called the Bayes decision rule. Unfortunately, class posterior probabilities are unknown and difficult to accurately estimate in practice.

Classifier Decision Rule
We denote by T = {(x n , y n )} n∈ [[1,N]] the training set that consists of N pairs of a training sample x n ∈ X and of its class index y n ∈ [ [1, J ]]. We emphasize the difference between y(x n ) = y n , the given class index at x n , and C(x n ; Λ), the class label predicted by the classifier at any sample x n . The decision of most classifiers is formalized as The parameter vector Λ encompasses trainable parameters that are optimized on the training data (e.g., network weights) as well as hyperparameters that are traditionally set using validation data (e.g., regularization parameters). We refer to a specific value of Λ as "classifier status." g j (·; Λ) is called a discriminant function. Its value at a sample x estimates the degree to which x belongs to C j . g j (x; Λ) does not necessarily directly estimate P (C j |x) [20]. The goal of this training is to find Λ so that the values of g j (·; Λ) result in a classifier decision rule (5) that executes the Bayes decision rule (4).

Classifier Evaluation and Classifier Selection
A classifier evaluation metric measures the generalization ability of a classifier. This evaluation metric is traditionally the error probability, and its estimate is usually obtained using validation data. A lower classification error probability value indicates a classifier decision that has higher generalization ability. As illustrated in Fig. 1, classifier selection is the process of evaluating different classifier statuses Λ TR = {Λ m } m∈[ [1,L]] (green boxes on the left feed into the blue evaluation box in the center) in terms of a classifier evaluation metric (values for L different classifier statuses obtained are represented by red boxes on the right) and then selecting the status that scores the best (black box on the right).

Bayes Boundary
We denote by B * the classification boundary that is uniquely defined by Eq. 4, and we term it "Bayes boundary." B * is the optimal classification boundary that a classifier should execute. B * is locally defined by the equality between the highest class posterior probabilities. At a given x ∈ X , there can be equality between the three or more highest class posterior probabilities. For simplicity, we assume that equality is achieved only between the two highest class posterior probabilities at x in practice. We denote by {i * (x), j * (x)} the indexes of the two highest (true) class posterior probabilities at x. By convention, we order them so that i * (x) < j * (x). With these notations, This definition conveys the true impossibility of deciding for a single class along B * . We interpret this situation as uncertainty along B * .

Classifier Boundary
We denote by B(Λ) the classifier boundary that corresponds to the decision rule given by Eq. 5. Similarly to Section 3.1.1, we assume that B(Λ) involves equality between only the two classes that yield the highest discriminant function values at x. We denote by {i(x; Λ), j (x; Λ)} the indexes of these two classes at x; by convention, we order them so that i(x; Λ) < j (x; Λ). With these notations,

Boundary Uncertainty
Boundary uncertainty relies on two principles. First, the Bayes boundary B * solely consists of uncertain samples, whose two highest class posterior probabilities are equal. Second, there is a one-to-one relationship between a classifier decision and its corresponding classification boundary. Based on these considerations, we previously proposed to evaluate the generalization ability of a classifier parameterized by Λ in terms of the optimality of its classifier boundary. We defined boundary uncertainty as a classifier evaluation metric that measures the degree of equality between B * and B(Λ).
In order to define boundary uncertainty, our previous work considered the "local uncertainty" of samples x that are on B(Λ) and then denoted it byÛ(x; Λ). 1 In principle, the local uncertainty U(x; Λ) takes as input the class posterior probability P (C i(x;Λ) |x). 2 U(x; Λ) measures the degree of equality between class posterior probabilities P (C i(x;Λ) |x) and P (C j (x;Λ) |x). We only expect from U(·; Λ) that it takes a higher value, since P (C i(x;Λ) |x) and P (C j (x;Λ) |x) are closer to equality, and lower values otherwise. U(x; Λ) implicitly contains the top indexes {i(x; Λ), j (x; Λ)} through its arguments x and Λ. Figure 2 illustrates two possible choices of local uncertainty function: a triangle function (orange curve) and the binary Shannon entropy (blue curve). The horizontal axis of the graph is indifferently P (C i(x;Λ) |x) or P (C j (x;Λ) |x). The minimum and maximum values of these local uncertainty functions are U min = 0 and U max = 1, respectively. This results in a convenient range of values [0, 1] that is the same regardless of the dataset, where "1" indicates optimality of the boundary at x.
Then, given a classifier status Λ, Proposal 1 empirically defined the following classifier evaluation metric that we estimated from the training set [18] and termed "boundary uncertainty": N B (Λ) refers to a "near-boundary set" that ideally would consist of samples exactly on B(Λ). N B (Λ) acted as a practical approximation of B(Λ) consisting of training samples that were very close to B(Λ), all along B(Λ). | · | refers to the cardinal operator. The notation· in this paper refers to estimated quantities based on a finite dataset, in order to contrast with their expected value (based on an infinite dataset). In Eq. 8, boundary uncertainty is defined as the finite expectation of the local uncertainty function over the N B (Λ). Owing to the characterization of the Bayes boundary B * and to the property of the local uncertainty function, boundary uncertainty reaches a higher value because the classifier boundary is more highly optimal, and lower values otherwise. Furthermore, the relationship between classification error probability and boundary uncertainty was the equivalence between achieving U max Figure 2 Two possible local uncertainty functions: binary Shannon entropy (blue) and a triangle function (orange). and achieving the minimum error probability (Bayes error). We describe this matter in more detail in later sections.
To compute (8), Proposal 1 consists of two steps described in Sections 3.3 and 3.4, respectively. Before reviewing these two steps, we review the k Nearest Neighbor (kNN) regression rule that Proposal 1 relied upon to estimate the class posterior probability P (C i(x;Λ) |x) input into the local uncertainty.

Nonparametric Posterior Probability Estimation using k NN Regression
To estimate local uncertainties, both Proposal 1 and Proposal 2 rely on the kNN regression because it estimates class posterior probabilities without globally imposing a probability model on the class distributions. Given x ∈ X and j ∈ [ [1, J ]], in order to estimate P (C j |x), kNN regression requires a small volume V (x) that contains x. We denote by k(x) the number of samples that are contained in V (x), and by k j (x) the number of samples among them that carry the label C j (the basic version of kNN regression considers a uniform k across X , but adaptative kNN regression methods adapt k to each sample x).
kNN regression givesP (C j |x) = k j (x)/k(x). A higher k(x) provides more samples for the estimation (lower variance), but these samples are farther away from x (higher bias). This requires a tradeoff. k(x) is usually set using validation data. training set T whose samples of given labels C 1 and C 2 are represented by triangles and squares, respectively. Proposal 1 evaluates a candidate classifier status Λ that it takes as Input (left-hand side of Fig. 3). We illustrate B(Λ) with a blue curve.

Step 1: Generation of Anchors
Proposal 1 directly applied the definition of B(Λ) (7) to find B(Λ). It searched for zeros of the function f (x; Λ) = g i(x;Λ) (x; Λ) − g j (x;Λ) (x; Λ) in X . a is characterized by f (a; Λ) = 0, so the theorem of intermediate values applied to f (·; Λ) guarantees the existence of a sample a ∈ B(Λ) on any segment [x, x ] that satisfies f (x; Λ) > 0 and f (x ; Λ) < 0. We termed such an on-boundary sample as a an "anchor" (anchors are represented by red dots in Panel 1A). Proposal 1 therefore searched for anchors along the segments that join randomly picked couples of training samples {x; x } that satisfy f (x; Λ) > 0 and f (x ; Λ) < 0. The search along [x; x ] can be done by a dichotomy: at each dichotomy iteration, we checked the sign of f (·; Λ) at the middle of the segment, and then considered only the half of the segment that contains a change of sign of f (·; Λ) (namely, an anchor). In practice, we imposed a maximum number of dichotomy iterations that we denoted by i max .
How many and where anchors should be generated was a priori not obvious. In Eq. 6, the equality of class posterior probabilities should be checked all along B(Λ). To get as close to this ideal situation as possible, Proposal 1 generated a large number of anchors denoted by R A , each of which we obtained from a random pair of training samples as described above. In the multiclass case, Proposal 1 had to check i(·; Λ), j (·; Λ) at each dichotomy iteration. Indeed, if we generate an anchor defined by equality between two discriminant functions whose class indexes are not locally the highest, then this anchor is actually generated on a non-existing classifier boundary (in the sense of Eq. 7).

Filtering Out Off-Boundary Samples
Given an anchor a ∈ B(Λ), the most direct way to apply kNN regression would be to use as target volume V (a), the volume defined by the k training samples that are nearest to a. However, this approach led to poor quality of the boundary uncertainty estimate, especially for higherdimensional data, because a naive spherical target volume largely contains off-boundary samples that are biased in terms of boundary uncertainty estimation. Proposal 1 filtered out this bias by forming target volumes that only contain near-boundary samples. Proposal 1 defined the set of near-boundary samples denoted by N B (Λ) as the set of one-nearest training sample for each anchor generated in Step 1 (Panel 1B).

Partitioning Near-Boundary Set into Target Volumes
Proposal 1 used Tree Divisive Clustering (TDC), of which imposed number of repetitions is R T , to break down N B (Λ) into clusters that we used as target volumes for kNN regression. We used such a clustering method so that the obtained target volumes could adapt to the distribution. As described in the next paragraph, the iterative nature of TDC enabled us to progressively keep some control over the size of clusters.
Every step of the TDC consists of a 2-means clustering that ends after an imposed i K number of iterations: TDC thus adaptatively broke down N B (Λ) into clusters (Panel 1C). We denote by ¶ Λ the set of resulting clusters, and by C ∈ ¶ Λ one of the clusters. Here, the dependency in Λ emphasizes that the partition obtained by clustering is defined for a given N B (Λ). Proposal 1 applied the kNN regression by using each C as a target volume (Panel 1D), or in other words, the "k" of "kNN regression" is set to |C| for each cluster C.
Clusters should contain enough samples to perform kNN regression, while focusing on B(Λ) as locally as possible (local uncertainties ideally measure equality between class posteriors on points along B(Λ)). Proposal 1 therefore continued to divide N B (Λ) into smaller clusters so long as the clusters obtained at a given TDC step contained more than N max samples. Then, Proposal 1 only retained clusters that contained more than N min samples.

Local Uncertainty Computation
Proposal 1 adopted binary Shannon entropy as the local uncertainty function (blue curve in Fig. 2). When using the kNN regression to estimate class posterior probabilities, we do not consider individual samples but rather target volumes (in our case, elements of ¶ Λ ). Therefore, in the following, we explicitly describe the local uncertainty of clusters rather than that of individual samples.
Given a near-boundary cluster C ∈ ¶ Λ , for each given class index j ∈ [[1, J ]], we count the number of samples that have predicted class index j . To estimate {i(C; Λ), j (C; Λ)} defined in Section 3.1.2, we define {î(C; Λ),ĵ(C; Λ)} as the pair of predicted class indexes that have the most samples in C. Then, we estimate the local uncertainty of this cluster aŝ Recall that our estimation goal is ideally B(Λ) itself, so we hope that C is centered on B(Λ). Unfortunately, the TDC may result in some Cs "completely on one side of B(Λ)," which results in a kind of bias in terms of local uncertainty estimation.

Boundary Uncertainty Computation
Sampling elements (clusters) from the partition ¶ Λ means that we are sampling independent outcomes (clusters) from the vicinity N B (Λ) of the classifier boundary. Each cluster C has probability |C| | ¶ Λ | . The finite expectation of the local uncertainty function over ¶ is thuŝ Superscript "(1)" refers to Proposal 1, in order to distinguish it from the boundary uncertainty estimate obtained by Proposal 2. ¶ Λ obtained by a TDC depends on the random clustering seeds used at each TDC step (i.e., a 2-means clustering).
Step 2 of Proposal 1 accordingly performed R T > 1 different runs of TDCs. Each run indexed by r ∈ [[1, R T ]] has its own different set of random clustering seeds and results in a partition we denote by ¶ We illustrate (11) with Panel 1D contained in the black box that indicates R T repetitions of Step 2.

Issues in Step 1 of Proposal 1
On the one hand, anchors and then the selection of their nearest training samples answered the difficult question "what is closest to B(Λ)?" by accurately executing the definition of B(Λ). On the other hand, an approach that revolves around anchors to search for B(Λ) may not be practical, since it is impossible to generate infinitely many anchors as required by the definition of B(Λ). Moreover, generating anchors from random pairs of training samples (that may be far from each other) results in costly treatments as described in Section 3.3. For example, resuming the notations of Section 3.3, for every checked position This cost explodes as it is multiplied by the number i max of dichotomy iterations, then by the large number R A of anchors, and finally by the number of classifier statuses to evaluate.

Issues in Step 2 of Proposal 1
On the one hand, previous experiments showed that the restriction to N B (Λ) of target volumes used for kNN regression efficiently increases the accuracy ofÛ (1) (Λ) [18]. On the other hand, a discrete selection of "what is close to B(Λ) or not" unavoidably leads to the following dilemma: either taking too few samples that are not representative of B(Λ) or taking too many samples that are farther from B(Λ).
Moreover, the centering of target volumes on B(Λ) is not explicitly guaranteed, which can result in a bias of the boundary uncertainty estimate. Furthermore, the formation of clusters obeys a certain clustering criterion, and we thus do not have direct control over the size of clusters (e.g., when applying the 2-means clustering, some clusters may persistently be broken into most of the cluster itself, and a "leftover" that consists of just one or two samples). This may result in a partition that has some persistently largesized clusters that do not measure locally the boundary uncertainty, as well as remaining "leftover" clusters that are too small to estimate local uncertainty. Last but not least, the R T repetitions of Step 2 increase both time and memory costs.

Overview of Proposal 2
To avoid the costly generation of anchors in the multidimensional space (Section 3.5), Proposal 2 smoothly and implicitly filters off-boundary samples in a single shot. This is made possible by using a single dimension that represents a kind of distance of a sample to the classifier boundary (Section 3.1.3).
In contrast to Proposal 1, which did not guarantee that near-boundary clusters are centered on B(Λ) (Section 3.6), Proposal 2 can explicitly focus its estimation on B(Λ) (Section 4.4.1). This is possible because the value "zero" of the aforementioned single dimension is equivalent to being on B(Λ).
In contrast to the difficulty of controlling the size of clusters in Proposal 1 (Section 3.6), Proposal 2 uses target volumes where the number of samples is directly specified. Then, the effectively used number of samples in each target volume is adjusted automatically in one shot by Proposal 2. The determination of the effective number of samples for each cluster and each candidate classifier status is both cheap and justified (Section 4.4.2): it is determined so that the local uncertainty estimation appropriately focuses on B(Λ) (Section 4.4.1).
The empirical formalization of boundary uncertainty used in Proposal 1 actually left some unclear items, such as the possible confusion of whether to use {i * (x), j * (x)} (top given class indexes) or {i(x; Λ), j (x; Λ)} (top predicted class indexes). Such confusion was especially possible in the several intricate branching treatments that appeared in Proposal 1, as well as in the treatment of multiclass data [18]. In contrast, this paper introduces a more complete formalization of boundary uncertainty (Section 4.3). This formalization clarifies these items and permits a more systematic approach to boundary uncertainty estimation.

Definition of Near-Boundary-Ness Measurement
Given any two class indexes k, l ∈ [[1, J ]], a classifier status Λ, and x ∈ X , we define the near-boundary-ness measurement as follows: A smaller value of |f kl (x; Λ)| means that x is closer to B kl (Λ), hence we termed f kl (x; Λ) a "near-boundaryness measurement." The sign of f kl (x; Λ) indicates which "side" of B kl (Λ) the sample x is, namely whether the classifier assigns x to C k or to C l . To simplify notations, we shorten f i(x;Λ)j (x;Λ) (x; Λ) to f (x; Λ) in the particular case {k, l} = {i(x; Λ), j (x; Λ)}, as well as in the two-class case. f kl (·; Λ) can be computed for any classifier whose formalization follows Section 2. For example, in the case of a neural network, f kl (x; Λ) is obtained by monitoring the output layer.

Estimation of Near-Boundary-Ness Measurement Values at Perturbated Samples Instead of at Training Samples
The goal of this section is to provide a reliable representation of X in the near-boundary-ness measurement space. This lengthy section describes a preliminary treatment that we use throughout Proposal 2, so we cover it here before describing Proposal 2 itself to provide the necessary clarity for following the later sections. We temporarily assume two-class data for simplicity of notation in this section.
Application of the function f (·; Λ) on training set T can be understood as sampling a mapped training set in the nearboundary-ness measurement space. We denote by f (T ; Λ) this projection. As covered in the next sections, Proposal 2 performs some density estimations on f (T ; Λ) in the near-boundary-ness measurement space. The goal of such density estimations is to estimate the density of f (X ; Λ), or in other words to generalize effectively to the density of the entire mapped sample space. However, f (·; Λ) reflects the classifier decision, and it is well known that the classifier decision values can be quite different on the training set and on some (unseen) testing set X te . This phenomenon is one aspect of overfitting.
In other words, even if T and X te follow similar distributions in the sample space, their mapped counterparts f (T ; Λ) and f (X te ; Λ) may follow quite different distributions in the near-boundary-ness measurement space. This sampling issue in the near-boundary-ness measurement space may be seen as a case of covariate shift [21].
Naively using f (T ; Λ) may not lead to an accurate estimation of the density of f (X ; Λ). We propose slightly perturbating training samples and then applying f (·; Λ) at such perturbated versions of the training samples instead of at the training samples themselves. We denote by T this perturbated version, where · (tilde notation) denotes the operation of "perturbation." Our hope is that f ( T ; Λ) suffers the covariate shift issue to a lesser extent than f (T ; Λ).
Algorithm 1 describes the generation of T . Given a training sample x, we denote by x (m) its m-th nearest training sample. The perturbation is uniform across the features of each training sample. The amplitude of the perturbation is adaptatively set small to stay as close as possible to the training set in the sample space. A detailed description of Proposal 2 itself is given in the next section to explain why we want to stay close to the training set in the sample space.
T may be simple to obtain in principle for a wide range of classification tasks. For example, in the case of speech data, our perturbation can be obtained by adding some noise to the speech signal, which seems to be a common practice when preparing speech data [22].

Formalization of Boundary Uncertainty
In order to more adequately derive an estimation procedure for boundary uncertainty, we start by defining the expected boundary uncertainty that we denote by U (Λ).

Notations
Recall that we denoted by {i * (x), j * (x)} in Eq. 6 the indexes of the two class posterior probabilities whose values are highest at x. We denote by I * the set of pairs of indexes that compose the Bayes boundary: ∀{k, l} ∈ I * , and we denote by B * (k, l) the set of training samples whose two top indexes are k and l:

Expected Boundary Uncertainty
Similarly to the definition of the risk in Eq. 1, we propose where R U (C(x)|x) represents the penalization of nonboundary uncertainty of the classification decision C(x) at is the opposite of a loss: Higher values of U (Λ) correspond to more optimal classifier decisions. The infinitesimal width δV is introduced to avoid a probability measure equal to zero. We will clarify this formalization item in a future work. p(x) denotes the density at x.
We now consider a classification decision C(x; Λ) made by a classifier that is parameterized by Λ, where C(x; Λ) is described in Eq. 5. In the boundary uncertainty framework, our quantity of interest in C(x; Λ) is the pair of the two top predicted indexes (Sections 3.1.1 and 3.1.2). In this section, we write C i(x;Λ)j (x;Λ) (x; Λ) instead of C(x; Λ) in order to emphasize this quantity of interest. The penalization of non-boundary uncertainty of the classifier at x is where we introduced the output variable C kl . Once more, the subscript in C kl involves two class indexes {k, l} owing to the focus of boundary uncertainty on the pair of class indexes that define the boundaries. We define C kl through the following discrete probability distribution: To evaluate the classification decision at x, the "boundarywise cost" λ U C i(x;Λ)j (x;Λ) (x; Λ) |C kl penalizes at x the non-uncertainty of the two-class classifier boundary that is relative to top indexes i(x; Λ), j (x; Λ) as follows: The integration of Eqs. 17 should be in most cases if the classifier training did not fail), then we evaluate the classifier boundary at x using the sign-reversed local uncertainty function −U(x; Λ). This branching treatment in our formalization reflects the branching treatments that appeared in Proposal 1 [18]. The Bayes theorem gives P (C kl |x)p(x) = p(x|C kl )P (C kl ), where P (C kl ) corresponds to the overall probability in the sample space X that {k, l} are the true top class indexes. Substituting (17), (18) and using the Bayes theorem in Eq. 16 gives (19) where we note that B * (k, l) is defined in Eq. 14 and appears as a sum only over pairs {k, l} ∈ I * because the definition in Eq. 17 implies that the terms of Eq. 19 are non-zero only for pairs {k, l} ∈ I * . ∀{k, l} ∈ I * , we define Equation 19 shows that in a multiclass setting, U (Λ) appears as a weighted combination of two-class uncertainty boundaries {U kl (Λ)} {k,l}∈I * . The above expressions generalize both our previous empirical understanding of boundary uncertainty defined by Eq. 8 and Proposal 1.

Property of Boundary Uncertainty
As we described in Section 3.1.3, our previous work implicitly assumed equivalence between U max and achieving the Bayes decision rule. We now consider this statement more carefully: (in practice) ⇔ B(Λ) = B * (23) ⇔ Bayes decision rule defined by Eq. 4 is achieved. (24) Section 6.9 discusses the successive equivalences (21) to (24) in more detail. The next sections describe the estimation of U (Λ).

Empirical Boundary Uncertainty and its Implicit Centering on Classifier Boundary
In Proposal 2, we attempt to estimate the expected boundary uncertainty defined by Eq. 19 with the following empirical sum: B * (k, l) is the approximation of B * (k, l) obtained from the training set, andλ U C i(x;Λ)j (x;Λ) (x; Λ) |C kl is the empirical version of Eq. 18, which is discussed in detail in Section 4.4.4. We note that execution of the sum defined by Eq. 25 requires a preliminary estimation of I * , and we give details of its estimate denoted byÎ * in Section 4.4.3.

Figure 4
Illustration of the locally one-dimensional count by Proposal 2 on a two-class and two-dimensional dataset. Samples of given class labels C 1 and C 2 are represented by yellow squares and green triangles, respectively. For a given Λ, we represent B(Λ) with a blue curve. The horizontal axis f (·; Λ) is described in Section 4.2.
A key point of Proposal 2 is to implicitly perform the filtering expressed by "∩ (B(Λ) + δV )" in Eq. 25 without having to explicitly generate anchors. This actually draws inspiration from a work that used a one-dimensional space based on the discriminant functions to estimate the classification error probability [23]. Here, we describe how to implicitly define clusters that are centered on B(Λ).
Given an integer M > 0, ∀x ∈ T , we denote by N M (x) the set (cluster) formed by x and its M −1 nearest neighbors in T : We consider a given x ∈ T . Given a kernel function (e.g., Gaussian kernel) and a small h x > 0 to specify, we define where we note that y(x ) is the given class index of training sample x and that An important factor is the tilde notation (operation of perturbation) when computing the near-boundary-ness measurement in Eq. 27 according to the considerations mentioned in Section 4.2.2.k m (x; Λ) counts how many samples with given class index m fall within a distance h x of B(Λ) in the near-boundary-ness measurement space. 3 We definê 3 The formalism of Eq. 27 forms the basis of deriving the Kernel Density Estimation [24].k (x; Λ) implicitly delineates an on-boundary cluster contained in [−h x ; h x ] in the near-boundary-ness measurement space, and that only contains (a total ofk(x; Λ)) samples with given class labels C i(x;Λ) and C j (x;Λ) . As we describe in Section 4.4.2, h x is usually a small value. The probability mass contained in a small volume is conserved through a change of variable. Therefore, the "implicitly defined onboundary cluster that is centered on f i(x;Λ)j (x;Λ) (·; Λ) = 0 in the near-boundary-ness measurement space" corresponds to a "small region δV (x) that is centered on B(Λ) around x in the sample space." We denote by δV the union of these individual volumes δV (x); Eqs. 27 and 28 thus adaptatively execute the filtering expressed by " ∩ (B(Λ) + δV ) " in Eq. 25, without having to explicitly generate anchors nor select their nearest training samples. Figure 4 illustrates (28) with M = 5 on a twoclass dataset. We illustrate the cluster N M (x) with a red circle, and the projection of (the perturbated) N M (x) onto f (·; Λ) with a zooming effect. The Parzen kernels are represented by black kernels centered on each projection of a perturbated sample in the cluster.
For convenience, we introduce ∀{k, l} ∈Î * , Each on-boundary implicit cluster containsk(x; Λ) samples, and thus the probability of each cluster among all clusters forB * (k, l) is k(x; Λ)/w kl (Λ). Therefore, we can re-write (25) as the finite expectation of the local uncertainty function over the set of implicit on-boundary clusters described above: where superscript "(2)" distinguishes this from the estimate obtained by Proposal 1 in Eq. 11. For convenience we define the estimated two-class boundary uncertainty: ∀{k, l} ∈Î * ,

Determination of Parameters Defining On-Boundary Implicit Clusters
Given x ∈ T , the parameters h x and M determine the on-boundary cluster that is implicitly defined by Eq. 27 (M determines the "radius" of N M (x), and then h x is set in the direction orthogonal to B(Λ)). The appropriate setting of h x temporarily requires us to consider a density estimation instead of a count estimation, since a count goes to infinity as the size of the dataset goes to infinity. The density estimation that corresponds to Eq. 27 is obtained by dividing (27) by Mh x [24]. A possible way of setting h x for optimal density estimation is to minimize the Mean Squared Integrated Error (MISE). This results in "Silverman's rule of thumb" [25]:

Estimation of I * and its Related Quantities
By "quantities that are related to I * ," we refer to ∀{k, l} ∈ I * , B * (k, l) and P (C kl ). I * assumes knowledge of true class posterior probabilities. However, rather than requiring an accurate knowledge of true class posterior probabilities, I * only requires knowledge of the indexes of the two highest class posterior probabilities at each sample. In other words, a rougher estimate of class posterior probabilities is sufficient for our purpose, so long as the two top indexes are correctly estimated.
Independently from the classifier model that we want to evaluate, in order to estimate class posterior probabilities, we propose using a generative classifier model that we explain in detail in Eq. 35. We denote by Λ gen the trained classifier status of this generative classifier model. Then, ∀x ∈ T , we estimate {i * (x), j * (x)} as i(x; Λ gen ), j (x; Λ gen ) . Application of Eq. 13 to these estimated pairs of indexes givesÎ * . Then, ∀{k, l} ∈Î * , and then Regarding the choice of a generative classifier model, the above considerations imply that a simple model may be enough for our purpose. For this reason, we chose the following prototype-based classifier (PBC), which has low computation costs. ∀j ∈ [[1, J ]], we denote by T j the set of training samples with given class label C j , namely T j = {x ∈ T |y(x) = j }. Our PBC represents T j by K j prototypes. We denote by p k j k∈[ [1,K j where Λ gen corresponds to the set of prototypes obtained by class-wise K-means clustering. As one possibility, ∀j ∈ [[1, J ]], we set the value of K j based on the Akaike Information Criterion (AIC). AIC can be used to determine the number of a Gaussian mixture model, and a K-means algorithm is a particular instance of the Classification Expectation algorithm for a Gaussian mixture model with equal mixture weights and equal isotropic variances [26]. We used existing results to adapt AIC to each class-wise K-means [26]. We give these results in the Appendix. Algorithm 2 summarizes the estimation of I * and that of its related quantities.

Estimation of Boundary-Wise Cost
Given {k, l} ∈Î * and x ∈ T , the estimated boundary-wise cost that appears in Eq. 25 iŝ Instead of the binary Shannon entropy used by Proposal 1, Proposal 2 uses a triangle-shaped function (orange curve in Fig. 2) as a local uncertainty function. This corresponds to the following local uncertainty: ∀m ∈ {i(x; Λ), j (x; Λ)}, The reason for this choice was to achieve a more neutral penalization of non-uncertainty than the binary Shannon entropy. The binary Shannon entropy weakly penalizes nonuncertainty in a wide range of [0.3; 0.7] around 0.5, while outside this range it strongly penalizes non-uncertainty. Just as done in Proposal 1, Proposal 2 estimates class posterior probabilities using the kNN regression rule (Section 3.2). ∀m ∈ {i(x; Λ), j (x; Λ)}, the kNN regression applied in the near-boundary-ness measurement space giveŝ When we inputP (C m |x; Λ) in the local uncertainty function defined in Eq. 37, the first and second cases of the branching in Eq. 38 result in the first and second cases of the branching in Eq. 36, respectively. ∀j ∈ [[1, J ]], we denote byP j the estimate of class prior probability, and N j is the number of training samples whose given class label is C j :P j = N j /N. In order to address class imbalance [29], Proposal 1 replaced each estimateP (C l |x; Λ) byP (C l |x; Λ)/P l [18]. Proposal 2 does not perform such adjustment. Indeed, if we assume an adequate sampling, then unequal class prior probabilities are implicitly handled by the kNN regression.

Two-Class Boundary Uncertainty Estimatê U kl (Λ) in Separability Case
This section assumes a pair {k, l} ∈Î * and refines (31) to handle the case where w kl (Λ) = 0 (29). This case occurs when class distributions in the sample space are well-separated.
Let us assume that class distributions are separated by a nearly empty region in the sample space. In this case, the Bayes boundary B * is located in the empty region. We consider the three possible cases of B(Λ) and illustrate them in Fig. 5.
Case 1: B(Λ) (green curve) is close to B * (red curve). In this case, we are likely to get w kl (Λ) = 0. As a result, Eq. 31 cannot be computed. However, we would like Proposal 2 to assign a best boundary uncertainty value U max to U (2) kl . We notice that all samples inB * (k, l) are correctly classified by such a B(Λ).   1, 2, and 3. These three panels respectively correspond to Cases 1, 2, and 3, which we describe in Section 4.5.
However, we would like Proposal 2 to assign a worst value U min . We note that samples of either class C k or class C l are all misclassified because, in this case, B(Λ) assigns all samples ofB * (k, l) to the same class.
We must refine (31) so that it can output a reasonable value ofÛ kl (Λ) even in Cases 1 and 3. Based on the observation of Cases 1 and 3, we can quite simply identify these two cases by checking the classification error rate on T . We do not check the error rate on T but instead on T , since the error rate itself directly depends on the values of the nearboundary-ness measurement. As described in Section 4.2.2, we always use f ( T ; Λ) instead of f (T ; Λ). We denote by L kl tr (Λ) the classification error rate on the perturbated version ofB * (k, l) (namely, x + r x ||x − x (1) || x∈B * (k,l) ).
This results in the following extended definition ofÛ kl (Λ) that we use instead of Eq. 31: if w kl (Λ) = 0 and L kl tr = 0, U min , if w kl (Λ) = 0 and L kl tr > 0, We then re-define the boundary uncertainty estimate aŝ Superscript "(3)" distinguishes this from the estimate defined by Eq. 30, but we still consider it a part of Proposal "2".

Implementation of Proposal 2
Algorithm 3 summarizes the implementation of Proposal 2.

Time Costs of Classifier Evaluation
One of the goals of Proposal 2 is to achieve scalability, since the scalability of resampling-based classifier evaluation methods such as CV is limited. Therefore, this section describes the time costs of Proposal 2 to evaluate Λ TR = {Λ m } m∈[ [1,L]] . We denote the cost of one addition, multiplication, computation of the exponential function, comparison, and classification as c add , c mul , c exp , c comp , c CL (d, J, Λ), respectively. c CL (d, J, Λ) increases with d, J and with the number of parameters in Λ. For comparison, we will also describe the time costs of Proposal 1.

Time Costs of Proposal 2
We denote by c (2) shared the costs of Proposal 2 that are shared across all classifier statuses, or in other words, that appear only once. Here, superscript "(2)" distinguishes them from the time costs of Proposal 1. We denote by c (2) Λ the costs that are exclusive to the evaluation of one classifier status Λ. The costs to estimate L classifier statuses Λ TR = {Λ m } m∈[ [1,L]] is c (2) (Λ 1 , · · · , Λ L ) = c (2) shared + L m=1 c (2) (Λ m ). (41)

Estimation of c (2) shared
Regardless of the number of candidate statuses Λ to evaluate, Proposal 2 requires us to compute and store the set of clusters {N M (x)} x∈T , as well as the set of distances {||x − x (1) ||} x∈T . We obtained these clusters and distances using a KDTree, so the total cost of the construction and search of the nearest neighbors is O(N log(N)) distance computations [30]. The cost of one distance computation in X is O(d)(c add + c mul ). As summarized in Algorithm 3, the generation of T is done only once. The cost of generating T is O(dN)c add .
K-means clustering costs KN distance computations at each K-means iteration. A distance computation costs O(d)(c add +c mul ). We denote by i K the imposed maximum number of iterations of K-means. If we assume that each class contains roughly the same number of training samples N/J , then the total time costs of obtaining our trained PBC in Algorithm 2 are O(i K dN)(c add + c mul ). We repeat these costs as many times as we attempt different values for the number of prototypes per class. We denote by R K the number of tried values of K. The preliminary time costs of Proposal 2 are thus c (2)

Estimation of c (2)
Λ Given one candidate classifier status Λ, finding the top indexes i(x; Λ), j (x; Λ) for the N training samples using QuickSort requires O(NJ log(J )) comparisons. Then, the cost of the set of operations defined by Eqs. 27, 32, 38 is independent of J and d. As stated above, it only requires simple arithmetic operations (e.g., mean on M samples, multiplications, application of the exponential function) on a two-dimensional array of size at most (N, M), whose rows each correspond to a training sample x ∈ T and whose columns each correspond to a single perturbated element of N M (x). The cost of computing near-boundary-ness measurement values on T is the cost of classifying T . c (2) (Λ) = c (2) add + c (2) mul + c (2) comp +c (2) exp + c

Time Costs of Proposal 1
We denote by c (1) (Λ) the time costs to estimate U (Λ) for a single candidate classifier status Λ. We denote by c (1,1) (Λ) and c (1,2) (Λ) the time costs of Step 1 and Step 2 of Proposal 1, respectively. We detail these costs in the following sections. Step 1 and Step 2 are executed independently, hence c (1)

Time Costs of Step 1 in Proposal 1
Here, we take up the notations from Section 3.3. The cost of Step 1 is the cost of generating R A anchors and then searching for their nearest neighbor. Given a random segment [x; x ], the search for an anchor between x and x is executed by dichotomy, whose maximum number of iterations we denote by i max . Each dichotomy iteration generates one artificial sample. Given one such artificial sample that we denote by c (1,1) (Λ) = c (1) add + c (1) mul + c (1) comp + c (1) where c (1,1)

Time Costs of Step 2 in Proposal 1
We now take up the notations from Section 3.4. We assume for simplification that at the end of the TDC, each cluster contains N max samples. During the first step of a given TDC, the main cost of the 2-means clustering is to compute the distance between each of the N samples and the two cluster centroids. This results in 2N distance computations. There are i K iterations in 2-means clustering, so the cost of the 2-means clustering on N samples is O(2i K dN)c add .
During the second step of a given TDC, 2-means clustering is applied to each of the two clusters that consist of N/2 samples. This corresponds to the computation of two times 2N/2 distances, namely 2N distance computations overall. More generally, we note that 2N distances are computed regardless of the dividing step in a given TDC, and thus the cost of each TDC step is always O (2i K dN)(c add + c mul ).
To obtain the total cost of a given TDC, we estimate the number of TDC steps that we denote by i T for convenience. Assuming clusters of roughly equal size within each step, i T satisfies N/2 i T = N max , and thus i T = log(N/N max ) +1. The cost of Step 2 of Proposal 1 is therefore c (1,2) (Λ) = c (1,2) add + c (1,2) mul , where The

Experiments
The goal of our experiments is three-fold. Goal (a): Assess whether Proposal 2 accurately estimates the boundary uncertainty defined by Eq. 15 even without costly traditional estimation methods such as re-sampling. Goal (b): Confirm once more that boundary uncertainty is a relevant quantity to perform classifier evaluation and classifier selection. Goal (c): Compare Proposal 2 in terms of the accuracy of classifier selection and scalability with existing widely applicable and powerful methods (HO and CV), but also with Proposal 1.

Classifiers
In this early stage of our research, we focus on the careful design and analysis of boundary uncertainty estimation and thus restrict our experiments to the selection of a single hyperparameter.
We assessed Proposal 2 in the evaluation and then selection of classifier statuses of the Gaussian kernel SVM classifier for two reasons. First, this classifier has only two hyperparameters: the Gaussian kernel width and the regularization coefficient. Second, we can easily obtain extreme cases of insufficient or excessive representation capability by controlling the values of these two hyperparameters, owing to the infinite VC capacity of Gaussian kernel SVMs and to the possibility of analytically obtaining the global minimum of training objectives of SVMs [27]. Gaussian kernel SVMs thus provide a simple way of analyzing Proposal 2 over a wide range of classifier boundary cases. We varied the hyperparameter values by powers of 2 in order to easily sweep a large range of values, while avoiding an excessively rough search (as powers of 10 may result in), e.g., 2 −15 , 2 −14 , · · · , 2 14 , 2 15 .
To simplify the analysis, we preliminarily selected the regularization coefficient using CV [18]. Our experiments focused on the selection of the Gaussian kernel width. Following the notations of the SVM implementation that we used, 5 we denote by γ the inverse of the Gaussian kernel width. A higher γ corresponds to a higher capacity of the classifier to draw a more complex B(Λ). For each value of γ , we performed full SVM training, and then we evaluated the resulting classifier status with Proposal 2, HO, and CV.
Just to ensure that Proposal 2 can also perform evaluation of other classifier models, we also succinctly assessed Proposal 2 on a MultiLayer Perceptron (MLP). The goal in our experimental setting was not really to achieve stateof-the-art performance but rather to perform analysis of Proposal 2 on MLP. Therefore, we simply considered an MLP with two hidden layers of 128 and 64 units, respectively. Both layers use ReLu activation functions. The output layer uses cross entropy as the objective function that we optimized with RMSProp, using a library available online. 6 As an MLP selection experiment, we performed an early stopping experiment, namely we looked for the optimal number of training epochs inside a single instance of classifier training.

Datasets
We performed our experiments on real-world benchmark datasets available online 7 that are quite small and basic but that provide some diversity in terms of dimensionality, number of samples, nature of the features, and class overlap. Furthermore, we prepared three synthetic two-dimensional and two-class datasets using Gaussian Mixture Models: GMM, GMM separable, and GMM inclusion . We illustrate these datasets in Figs. 10, 14, and 15, respectively. For the GMM dataset, we generated a testing set from the same mother distribution as T to provide more exhaustive experimental results. Owing to its larger number of available samples, we could also afford to split the Letter Recognition dataset into a training set and a testing set of equal size.

Data Preparation
On all of the datasets, we standardized independently each feature by removing the mean and then scaling to unit variance across the entire training set. 8 For the datasets that also have a testing set, we applied the same standardization to the testing set, based on the means and variances that were measured from the training set.
To obtain T , we applied Algorithm 1 on the standardized data, using the nearest distances ||x − x (1) || that were measured on the standardized data.
The two hyperparameters of Proposal 2 are M and K, which appear in Sections 4.4.1 and 4.4.3, respectively. For all of the datasets, we set M to 40 because a meaningful computation of Eq. 32 seems to require at least 30-40 samples. In all of the datasets, we searched for K in the range [1,40] with a step of 2, and then we selected its value as described in the Appendix.

Exhaustive Results on Synthetic Data
In Fig. 6, we display the SVM selection results for the GMM dataset. There are two ideal alternatives for assessing Goal (a) (for our three goals, see the beginning of Section 6). One alternative is to visualize the similarity between B * and B(Λ) in the multidimensional sample space; however, this is not practical. Another alternative is to use a large testing set (not to be confused with the validation sets that are used by HO and CV) as a reference truth that gives estimates as close as possible to the expected values. In this case, for each Λ that is preliminarily obtained by training the classifier on T , we obtainÛ tr (Λ) by using T as input of Algorithm 3, and then we separately obtainÛ te (Λ) by using the testing set as input of Algorithm 3. If Proposal 2 is an accurate boundary uncertainty estimation method, then Proposal 2 should satisfy ∀Λ,Û tr (Λ) ≈Û te (Λ) .
In Panel 3 of Fig. 6, we plot −Û tr (Λ) (blue) and −Û te (Λ) (black) against γ . We can see that the blue and black curves in Fig. 6 are nearly identical. This seems to validate the accurate estimation of the boundary uncertainty by Proposal 2 without relying on resampling methods, and thus it achieves Goal (a). We note that the boundary uncertainty estimation in Proposal 2 essentially relies on Eqs. 27 and 32. Accurate and scalable estimation of ratios for multidimensional data such as in the kNN regression usually requires advanced estimation methods [28]. In our case, accurate estimation of the ratio formed by Eqs. 27 and 28 with the accuracy shown in Fig. 6 may be explained by the local reduction of the boundary uncertainty task to the single dimension formed by the near-boundaryness measurement. This local reduction enabled the use of analytic estimation rules such as Eq. 32, which may be sufficiently accurate on one-dimensional data by simply requiring training data.
We can assess Goal (b) by comparing the behavior of −U (Λ) with the behavior of well-established classifier evaluation metrics such as error probability. A more optimal classifier status corresponds to a lower error probability, and to a lower sign-reversed boundary uncertainty −U (Λ). We can thus expect the two evaluation metrics to mutually confirm their validity by following the same trends, and especially by hitting a minimum for the same Λ (hopefully the Bayes error status). Incidentally, we can assess Goal (a) by checking whether −U (Λ) reaches its minimum value -1 when the minimum error probability is achieved. We thus check −Û te (Λ) andL te (Λ) against γ for the GMM dataset on Panel 1 of Fig. 7, and we see that Goal (b) is achieved.
In Panel 1, −Û te (Λ) actually shows a sharper minimum thanL te (Λ). This sharper trend can be explained by the focus of boundary uncertainty precisely on the classifier boundary: Overall similar error probabilities can correspond to quite different classifier boundaries. Boundary uncertainty-based classifier evaluation may discriminate between classifier statuses more finely than error probability, provided that the estimation method is accurate enough. This implies that even though we can expect some similar trend between −Û te (Λ) andL te (Λ), expecting exactly the same trend between the two curves is not necessarily the goal of boundary uncertainty. We emphasize that the purpose here is not to provide exactly the same trend aŝ L te (Λ), and thatL te (Λ) is simply used as a default informative reference. This is the reason why we do not try to quantitatively measure the correlation between −Û te (Λ) andL te (Λ), since this might imply that our goal is to fit L te (Λ) with −Û te (Λ).
In practice, we may not always have access to a testing set. Instead, CV may be used to estimate the error probability. Therefore, we also assessed CV-based estimates of the classification error probability. Our experiments adopted stratified CV 9 with a high number of folds (10 to 40) to get closer to LOO, whose asymptotic unbiasedness is proven. We also tried HO, by viewing HO as a more practical competitor of Proposal 2 than CV in terms of computation cost. For HO, we applied a ratio (75%, 25%) for the split (train, validation).
We illustrate CV and HO in Panel 2 of Fig. 6: averagê L tr (Λ) over the error probability estimated on the training folds of CV (orange); averageL val (Λ) over the error probability estimated on the validation folds of CV (green); error probability estimated on a holdout validation set L HO (Λ) (dotted-black); error probability estimated on a large testing setL te (Λ) (red); sign-reversed boundary uncertainty estimated on the training set −Û tr (Λ) (blue); and sign-reversed boundary uncertainty estimated on a large testing set −Û te (Λ) (black).

Results on Real-Life Data
For the Letter Recognition dataset, we have a testing set, so we separately display more detailed results in Fig. 7, that follows the same layout as in the upper row of Fig. 6. We note how −Û tr (Λ) and −Û te (Λ) are strikingly close in Panel 3, which seems to confirm that Proposal 2 can accurately estimate boundary uncertainty. This contrasts once more with the impossibility of estimatingL te (Λ) simply withL tr (Λ) (orange curve in Panel 2).
We observe that the minimum value ofL val (Λ) on the Letter Recognition dataset is nearly zero, which implies that this dataset may be easy to classify and that performance for the Bayes risk can be achieved at this minimum value ofL val (Λ). The minimum ofL val (Λ) andÛ tr (Λ) coincide, but the minimum value of −Û tr (Λ) is not -1, as it would be if the classifier executed B * . If we assume that Proposal 2 accurately estimates boundary uncertainty (based on observations from Panel 3), then it would be reasonable to assume that a classifier status could be nearly optimal in terms of classification error probability, while boundary uncertainty may not appear quite optimal so long as B(Λ) does not even get closer to optimal B(Λ). In other words, boundary uncertainty may evaluate the generalization ability more strictly than the classification error probability.
For the other real-life datasets, we did not have access to a large testing set. In this case, the best available estimates of the classification error probability and of −U (Λ) areL val (Λ) and −Û tr (Λ), respectively. 10 We apply the same checks as in Section 6.4.1 but by replacingL te (Λ) andÛ te (Λ) withL val (Λ) and −Û tr (Λ), respectively. We expect similar trends and minimum values ofL val (Λ) and −Û tr (Λ) against the classifier status.
For comparison, we also consider −Û tr (Λ) obtained by Proposal 1 with Eq. 11. The slightly different trends of L val (Λ) between Proposal 1 and Proposal 2 may be due to slightly different splittings between our former experiments of Proposal 1 [18] and the experiments in this paper. Owing to the range of values of the binary Shannon entropy in Proposal 1, we rescaled −Û tr (Λ) in the results of Proposal 1 by a factor ln(2) so that the range of values becomes [0, 1] as in Proposal 2. In Proposal 2, the minimum values of −Û tr (Λ) are closer to −1 for each dataset, which achieves Goal (a). Figure 8 shows that for most of the datasets, Proposal 2 matches the benchmark CV in terms of trends and minimum, which reaches Goal (b). Additionally, the trends of −Û tr (Λ) in Proposal 2 provide a sharper minimum than either their counterpart in Proposal 1 orL val (Λ). This higher ability to sharply select an optimal classifier status achieves Goal (c). Figure 9 shows the MLP selection results for the Letter Recognition dataset. The layout is the same as in Fig. 7, but in this case the horizontal axis corresponds to the training epoch. Once more, Proposal 2 seems to achieve the three goals we described in Section 6.4.1 for the GMM dataset.

Illustration of Covariate Shift in Near-Boundary-Ness Measurement Space
In order to illustrate the covariate shift in the nearboundary-ness measurement space f (·; Λ), as well as the usefulness of T covered in Section 4.2.2, we selected an SVM classifier status Λ that has a rather high representation capability. We then plotted in Fig. 10 the three distributions f (T ; Λ), f ( T ; Λ), f (X te ; Λ) in the

Effect of T in Boundary Uncertainty Estimation
To show the effect of T , we display in Fig. 11 and in Panels 1bis and 3bis of Fig. 7 the results obtained by Proposal 2 when naively computing the values of f (·; Λ) on T . For almost all datasets, we observe that −Û tr (Λ) assigns a favorable uncertainty value even to excessively high values of γ that are clearly not optimal (highL val (Λ) as indicated on the left vertical axis). Such clear degradation of the accuracy of −Û tr (Λ) for higher values of γ confirms the usefulness of T . For the Spambase dataset, we see a noisier trend of −Û tr (Λ) in Fig. 8 compared to Fig. 11. This may call for refinements of the definition of T (Algorithm 1) in order to more "naturally" perturbate T .

Influence of Hyperparameter M onÛ (3) (Λ)
To ensure that Proposal 2 is not sensitive to the setting of M, we performed the above SVM selection experiments not only for M = 40 but also for a wide range of values: M = 20, 40, 80, 120, 160. We show the corresponding results in Fig. 12 for the GMM and Wine Quality White datasets. Figure 12 shows that Proposal 2 is quite insensitive to the value of M. The minimum and trend ofÛ (3) (Λ) seem quite insensitive to M, although the value ofÛ (3) (Λ) itself may slightly change. This insensitivity was observed for all datasets, so we only display the results for two datasets. Despite this apparent insensitivity, we prefer the smaller value M = 40 in order to have just enough samples for the estimation described by Eq. 27, while focusing locally along B(Λ).

Influence of Hyperparameter K onÛ (3) (Λ)
In order to perform the classifier selection, as Section 4.4.3 implies, Proposal 2 requires that model selection be performed on a generative classifier model, using another model selection criterion (AIC in our case). This requirement may seem to defeat the purpose of our proposal, both conceptually and in terms of time costs. However, we show that this requirement is not a bottleneck but rather a rough initialization for Proposal 2.
First, the time costs incurred by the class-by-class search of the optimal number of prototypes are not high, as described in Section 6.10. Second, the setting of each K j is actually quite insensitive. Figure 13 illustrates the classifier evaluation results of Proposal 2 obtained by imposing   different values of the number of prototypes per class, which we set the same for all classes for simplicity and which we denote by K. We attempted the values K = 3, 10, 20, 40. We also display the results obtained by setting each K j using AIC as described in Algorithm 2. I * has no influence on two-class datasets, so we only display results for multiclass datasets. We only show two datasets, since results on other datasets show the same phenomenon. We set M = 40 in this paragraph. Figure 13 shows that results are almost the same despite the quite broad range of values for K. This insensitivity may be the result of I * only requiring the determination of the indexes of the two highest class posterior probabilities, instead of their actual values.

Handling of Datasets with Well-Separated Classes
This section illustrates the usefulness of the branching treatment proposed by Eq. 39 on a synthetic dataset that features well-separable classes, which we call GMM separable. Figure 14 shows details of the SVM evaluation results obtained on the GMM separable dataset.
For this dataset, B * obviously lies in the middle of the empty region that separates the two classes, as also illustrated in Fig. 5. For four classifier statuses γ 1 to γ 4 , we represent the location of B(Λ) by plotting the values of the probability weights k(·; Λ)/w 12 (Λ) in a colormap. Training samples that are close to B(Λ) are represented in cyan, and those even closer to B(Λ) are represented in pink.
The colormap shows that γ 1 results in a biased B(Λ) that diagonally crosses the two classes, instead of passing between them. γ 2 results in a better B(Λ), although the classifier appears quite uncertain as B(Λ) seems to spread widely around the empty region (samples in cyan). γ 3 executes the Bayes boundary, namely B(Λ) ≈ B * : In this case, B(Λ) lies in the center of the empty region between the class distributions, and it is far from either class distribution, so no training sample appears in cyan or pink. γ 4 is too high: The resulting B(Λ) unnecessarily encircles both class distributions, which shows overfitting to the training set.
While all of γ 1 , γ 2 , γ 3 , γ 4 are assigned a quite favorable scoreL val (Λ), Proposal 2 can discriminate quite sharply between a classifier that truly executes B * , and a classifier that is actually quite far from executing B * , despite apparently low classification error on the finite data at hand. This sharp evaluation ability of Proposal 2 seems to be the result of the sharp estimation focus on B(Λ) and B * , combined with the perturbation used when evaluating the near-boundary-ness measurement that efficiently reacts to overfitting.

Experimental Considerations of Equivalence between U max and B *
The equivalence between achieving U max and achieving the Bayes decision rule is a key component of boundary uncertainty. Therefore, we further discuss the equivalences introduced in Section 4.5.
Equivalences (21) and (22) are obvious, based on the definitions of B(Λ), B * and U (Λ). Equivalence (24) (i.e., equivalence between B * and Bayes decision rule) is also quite straightforward. There may be one counter-example to Equivalence (24): Given a two-class task, we may have B(Λ) = B * , while all of the class labels may be mistaken. This counter-example is highly unlikely in practice, and even more unlikely in the case of multiclass classification.  We now elaborate on Equivalence (23). Strictly speaking, maximum boundary uncertainty measures the inclusion B(Λ) ⊂ B * . Here, the main point is that the perfect inclusion B(Λ) ⊂ B * is unlikely, and that even a nearly included B(Λ) is likely to have bias (un-Bayes-boundaryness) that can be detected with an accurate boundary uncertainty estimation. We first illustrate the inclusion issue on two-class data using the GMM inclusion dataset (top-left corner of Fig. 15).
We generated the GMM inclusion dataset so that B * consists of two fragments, denoted by B * (1) and B * (2) , respectively. B * (1) is a gentle curve between the blue (left) and red (right) sample crowds. B * (2) is an ellipse surrounding the blue dense crowd at the right side. We plot the classifier selection results by Proposal 2 on the GMM inclusion dataset with a Gaussian kernel SVM following the conventions used in Fig. 8. Λ A and Λ B are two trained classifier statuses that were obtained from γ A and γ B , respectively. Panels A and B visualize the classifier decision that corresponds to Λ A and Λ B , respectively, by showing the training samples with their predicted class labels (red and blue dots) as well as some estimated anchors (black dots).
B(Λ A ) is nearly included in B * : B(Λ A ) ≈ B * (1) , but B(Λ A ) "omitted" B * (2) . We can explain this by the insufficient representation capability of the SVM for Λ A . B(Λ B ) is nearly equal to B * . ThusÛ(Λ A ) ≈ U max , even though B(Λ A ) is clearly not optimal. In practice, the graph in Fig. 15 shows that Proposal 2 quite clearly outputsÛ(Λ B ) ≥Û(Λ A ), despite the near inclusion owing to the unavoidable (although admittedly small) bias of B(Λ A ) when trying to reproduce B * (1) with insufficient representation capability.
Assuming that B(Λ A ) was even closer to inclusion in B * , there may be a simple way of detecting the obvious non-optimality of B(Λ A ). To fix the ideas, we represent in the rectangle box below Panel A the schematic histograms of the distributions {f (x; Λ A )} x∈C 1 2 in blue and red, respectively. We note that f (x; Λ A ) = g 1 (x; Λ A ) − g 2 (x; Λ A ). Misclassifications are represented by the yellow areas under the histograms. We observe that the omission by B(Λ A ) of the entire fragment of Bayes boundary B * (2) results in a high amount of misclassification quite far away from f (·; Λ A ) = 0. However, the number of misclassifications usually become smaller and smaller as we go further away from B(Λ). In order to safely avoid the erroneous selection of classifier statuses with excessively low representation capability, investigating the detection and incorporation of such a suspicious burst of misclassifications far from B(Λ) in the computation of U (Λ) remains a possible future work.

Time Cost Comparison between Proposal 2 and Proposal 1
This section quantifies the speed improvement of Proposal 2 compared to Proposal 1. For several real-life datasets used in our experiments and for any given classifier status Λ, Table 3 displays the multiplicative gain of Proposal 2 over Proposal 1 in regard to each elementary cost introduced in Section 5 when estimating the boundary uncertainty. We only consider the time costs that are exclusive to the evaluation of a classifier status. For example, G add (Λ) = c (1) add (Λ)/c (2) add (Λ). We could not compute a gain in regard to c exp because this elementary cost was not present in Proposal 1.
To compute the values in Table 3, we optimistically assumed that R A ∼ N, although for higher-dimensional datasets, quite higher values of R A seemed necessary (i.e., Step 1 of Proposal 1 kept selecting different training samples even as the number of generated anchors increased well above N). We set the following values in Proposal 1 and Proposal 2: i max = 20, R T = 10, i K = 20, M = 40.
The values of d, J, N depend on the dataset, and they are given in Table 2. Table 3 shows that even for the relatively small datasets used in our experiments, Proposal 2 is 100 to 1000 times faster than Proposal 1, and this gain increases with d, J, N (Section 5).

Time Costs comparison between Proposal 2 and CV
The time costs of SVM training are between O(N 2 ) and O(N 3 ) [31]. Along with the cost of classifying the validation folds, such training costs are repeated as many times as there are validation folds in CV. To obtain an

Discussion and Conclusion
In this paper, we formalized a new boundary uncertainty estimation method that provides more accurate, applicable, and scalable classifier evaluation (no anchor generation, no random repetitions, no unclear settings). The new proposal also clarified the two main reasons why boundary uncertainty can be accurately estimated based on a finite amount of training data without costly methods such as Cross Validation (CV). First, a tight focus relative to the classifier boundary implies that there is locally a single dimension of interest in the boundary uncertainty estimation task: such an estimation task is fundamentally easier and can be accurately performed even analytically. Second, both the classifier boundary and the Bayes boundary are identified by known conditions that can be accurately approximated from the training data. Performing classifier evaluation in a single shot without averaging, as done in CV, may in a sense improve interpretability: "what we got is what we evaluated." This contrasts with CV, where the final model is the result of separately re-training the entire available dataset, which is also different from each of the models averaged over the training folds.
Our approach to classifier evaluation starts from the statistical assumption that there is class overlap around the Bayes boundary. In practice, the feature-extraction step before classification aims to generate well-separated classes where no sample is close to the Bayes boundary (Bayes risk is equal to zero). In this regard, our approach may seem paradoxical. Nevertheless, in practice, the data may be highdimensional and difficult to separate, implicitly containing irreducible misclassification. Traditional approaches based on classification error tend to easily overfit such difficult data, and the design of more and more powerful classifier models may also make them prone to overfitting unless adequate classifier evaluation is carried out. Our proposed approach may be especially useful in such settings. Meanwhile, our proposal can also handle simple cases with well-separated classes.
So far, we have focused our efforts on designing the concept of boundary uncertainty and obtaining reasonably accurate, scalable and applicable results. However, we have not yet obtained results that quantitatively show a definite gain in generalization ability compared to traditional approaches in state-of-the-art classification tasks. For now, boundary uncertainty can be used to determine whether the Bayes risk is achieved, and to design classifiers directly based on the training data. Boundary uncertainty actually defines another metric for generalization ability that could even be used as a reference to describe the performance on a testing set. Future steps will aim at careful consideration of how to evaluate boundary uncertainty itself, and how to make the most of it for accurate classification.
Other future steps for boundary uncertainty-based classifier evaluation include strengthening the equivalence between maximum boundary uncertainty and Bayes decision rule; deepening the formalization of boundary uncertainty and performing quantitative analysis of our classifier evaluation method, for example by establishing the rate of convergence of our boundary uncertainty estimator; and investigating applications to challenging tasks, including large-scale tasks where many parameters and hyperparameters are simultaneously optimized.

Appendix A: Class-Wise Application of AIC to Determine the Number of Prototypes per Class
We consider a class index j ∈ [ [1, J ]]. We denote by T j the set of training samples with given class label C j and by M j the set of prototypes that represents class C j , namely M j = {p k j } k∈[ [1,K j ]] . Given a training sample x ∈ T j , the likelihood of x can be calculated by assigning x to the mixture component of class C j that has the highest probability, namely to the prototype of class C j that is closest to x. We denote this prototype by p j (x). The likelihood of x is thus where we take the variance σ 2 j to be the within-cluster variance: The likelihood of T j is P (T j |M j , σ 2 j ) = x∈T j P (x|M j , σ 2 j ).
Finally, the AIC score for class C j represented with the number of prototypes K j is AI C(K j ) = log P (T j |M j , σ 2 j ) − (K j d + 1).
We select the value of K j that yields the highest value of AI C(K j ).