Surrogate regret bounds for generalized classification performance metrics

We consider optimization of generalized performance metrics for binary classification by means of surrogate loss. We focus on a class of metrics, which are linear-fractional functions of the false positive and false negative rates (examples of which include $F_{\beta}$-measure, Jaccard similarity coefficient, AM measure, and many others). Our analysis concerns the following two-step procedure. First, a real-valued function $f$ is learned by minimizing a surrogate loss for binary classification on the training sample. It is assumed that the surrogate loss is a strongly proper composite loss function (examples of which include logistic loss, squared-error loss, exponential loss, etc.). Then, given $f$, a threshold $\hat{\theta}$ is tuned on a separate validation sample, by direct optimization of the target performance measure. We show that the regret of the resulting classifier (obtained from thresholding $f$ on $\hat{\theta}$) measured with respect to the target metric is upperbounded by the regret of $f$ measured with respect to the surrogate loss. Our finding is further analyzed in a computational study on both synthetic and real data sets.


Introduction
In binary classification, misclassification error is not necessarily an adequate evaluation metric, and one often resorts to more complex metrics, better suited for the problem. For instance, when the classes are imbalanced, F β -measure (Lewis, 1995;Jansche, 2005;Nan et al., 2012) and AM measure (Menon et al., 2013) are frequently used. Optimizing such generalized performance metrics poses computational and statistical challenges, as they cannot be decomposed into losses on individual observations.
In this paper, we consider optimization of generalized performance metrics by means of surrogate loss. We restrict our attention to a family performance metrics which are ratios of linear functions of false positives (FP) and false negatives (FN). Such functions are called linear-fractional, and include the aforementioned F β and AM measures, as well as Jaccard similarity coefficient, weighted accuracy, and many others. We focus on the most popular approach to optimizing generalized performance metrics in practice, based on the following two-step procedure. First, a real-valued function f is learned by minimizing a surrogate loss for binary classification on the training sample. Then, given f , a thresholdθ is tuned on a separate validation sample, by direct optimization of the target performance measure with respect to a classifier obtained from f by thresholding atθ. This approach can be motivated by the asymptotic analysis: minimization of appropriate surrogate loss results in estimation of conditional ("posterior") class probabilities, and many performance metrics are maximized by a classifier which predicts by thresholding on the scale of conditional probabilities (Nan et al., 2012;Zhao et al., 2013;Natarajan et al., 2014). However, it is unclear what can be said about the behavior of this procedure on finite samples.
In this paper, we are interested in theoretical analysis and justification of this approach for any sample size, and for any, not necessarily perfect, classification function. To this end, we use the notion of regret with respect to some evaluation metric, which is a difference between the performance of a given classifier and the performance of the optimal classifier with respect to this metric. We show that the regret of the resulting classifier (obtained from thresholding f onθ) measured with respect to the target measure is upperbounded by the regret of f measured with respect to the surrogate loss. Our result holds for any surrogate loss function, which is strongly proper composite loss function (Agarwal, 2014), examples of which include logistic loss, squared-error loss, exponential loss, etc. Interestingly, the proof of our result goes by an intermediate bound of the regret with respect to the target measure by a cost-sensitive classification regret. Our finding is further analyzed in a computational study on both synthetic and real data sets. To our knowledge, this is the first regret bound of this form applicable to generalized performance metrics in binary classification.
Related work. Existing theoretical work on generalized performance metrics is mainly concerned with statistical consistency also known as calibration, which determines whether convergence to the minimizer of a surrogate loss implies convergence to the minimizer of the task performance measure as sample size goes to infinity (Nan et al., 2012;Zhao et al., 2013;Narasimhan et al., 2014;Natarajan et al., 2014). Here we give a stronger result which bounds the regret with respect to the performance metric by the regret with respect to the surrogate loss. Our result is valid for all finite sample sizes and informs about the rates of convergence. Parambath et al. (2014) present an alternative approach to maximizing linear-fractional measures by learning a sequence of binary classification problems with varying misclassification costs. While we were inspired by their theoretical analysis, their approach is, however, more complicated than the two-step approach analyzed here, which requires solving an ordinary binary classification problem only once. Moreover, as part of our proof, we show that by minimizing a strongly proper composite loss, we are implicitly minimizing costsensitive classification error for any misclassification costs without any overhead. Hence, the costs need not be known during learning, and can only be determined later on a separate validation sample by optimizing the threshold.
Outline. The paper is organized as follows. In Section 2 we introduce basic concepts, definitions and notation. The main result is presented in Section 3 and proved in Section 4. The theoretical contribution of the paper is complemented by computational experiments in Section 5, prior to concluding with a summary in Section 6.

Problem setting
Binary classifier. In binary classification, the goal is, given an input (feature vector) x ∈ X, to accurately predict the output (label) y ∈ {−1, 1}. We assume input-output pairs (x, y) are generated i.i.d. according to Pr(x, y). A classifier is a mapping h :  (Natarajan et al., 2014) for a more detailed description.
Given h, we define the following four quantities: which are known as true positives, false positives, true negatives and false negatives, respectively. Note that for any h, FP(h) + TN(h) = Pr(y = −1) and TP(h) + FN(h) = Pr(y = 1), so out of the four quantities above, only two are independent. In this paper, we use the convention to parameterize all measures by means of FP(h) and FN(h).
Generalized classification performance metrics. We call a two-argument function Ψ = Ψ(FP, FN) a (generalized) classification performance metric. Given a classifier h, we define Ψ(h) = Ψ(FP(h), FN(h)). Throughout the paper we assume that Ψ(FP, FN) is linear-fractional, i.e. is a ratio of linear functions: where we allow coefficients a i , b i to depend on the distribution Pr(x, y). 1 We also assume Ψ(FP, FN) is non-increasing in FP and non-increasing in FN, a property that is possessed by virtually all performance measures used in practice. Given any classifier h, we define its Ψ-regret as a distance of h from the optimal h * Ψ measured by means of Ψ: Strongly proper composite losses. Here we briefly outline the theory of strongly proper composite loss functions. See (Agarwal, 2014) for a more detailed description. Define a binary class probability estimation (CPE) loss function (Reid and Williamson, 2010, 2011) as a function c : {−1, 1}×[0, 1] → R + , where c(y, η) assigns penalty to prediction η, when the observed label is y. Define the conditional c-risk as 3 : the expected loss of prediction η when the label is drawn from a distribution with Pr(y = 1) = η. We say CPE loss is proper if for any η ∈ [0, 1], η ∈ arg min η∈[0,1] risk c (η, η). In other words, proper losses are minimized by taking the true class probability distribution as a prediction; hence η can be interpreted as probability estimate of η. Define the conditional c-regret as: the difference between the conditional c-risk of η and the optimal c-risk. We say a CPE loss c is λ-strongly proper if for any η, η: i.e. the conditional c-regret is everywhere lowerbounded by a squared difference of its arguments. It can be shown (Agarwal, 2014) that under mild regularity assumption a proper CPE loss c is λ-strongly proper if and only if the function H c (η) := risk c (η, η) is λ-strongly concave. This fact lets us easily verify whether a given loss function is λ-strongly proper.
If c is λ-strongly proper, we call function : {−1, 1} × R → R + λ-strongly proper composite loss function. The notions of conditional -risk risk (η, f ) and conditional -regret reg (η, f ) extend naturally to the case of composite losses: and the strong properness of underlying CPE loss implies: 3. Throughout the paper, we follow the convention that all conditional quantities are lowercase (regret, risk), while all unconditional quantities are uppercase (Regret, Risk).
loss function squared-error logistic exponential Table 2: Three popular strongly proper composite losses: squared-error, logistic and exponential losses. Shown are the formula (y, f ), the underlying CPE loss c(y, η) with the link function ψ( η), as well as the strong properness constant λ. See (Agarwal, 2014) for more details and examples.

Main result
Given a real-valued function f : X → R, and a λ-strongly proper composite loss (y, f ), define the -risk of f as the expected loss of f (x) with respect to the data distribution: 4. Q is the indicator function, equal to 1 if Q holds, and to 0 otherwise.
where η(x) = Pr(y = 1|x). Let f * be the minimizer Risk (f ) over all functions, f * = arg min f Risk (f ). Since is proper composite: Define the -regret of f as: Any real-valued function f : X → R can be turned into a classifier h f,θ : X → {−1, 1}, by thresholding at some value θ: The purpose of this paper is to address the following problem: given a function f with -regret Reg (f ), and a threshold θ, what can we say about Ψ-regret of h f,θ ? For instance, can we bound Reg Ψ (h f,θ ) in terms of Reg (f )? We give a positive answer to this question, which is based on the following regret bound: Lemma 1 Let Ψ(FP, FN) be a linear-fractional function of the form (1), which is nonincreasing in FP and FN. Assume that there exists γ > 0, such that for any classifier h : i.e. the denominator of Ψ is positive and bounded away from zero. Let be a λ-strongly proper composite loss function. Then, there exists a threshold θ * , such that for any realvalued function f : X → R, The proof is quite long and hence is postponed to Section 4. Interestingly, the proof goes by an intermediate bound of the Ψ-regret by a cost-sensitive classification regret. We split the constant in front of the bound into C and λ, because C depends only on Ψ, while λ depends only on . Table 3 lists these constants for some popular metrics. Lemma 1 has the following interpretation. If we are able to find a function f with small -regret, we are guaranteed that there exists a threshold θ * such that h f,θ * has small Ψ-regret. Note that the same threshold θ * will work for any f , and the right hand side of the bound is independent of θ * . Hence, to minimize the right hand side we only need to minimize -regret, and we can deal with the threshold afterwards.
Lemma 1 also reveals the form of the optimal classifier h * Ψ : take f = f * in the lemma and note that Reg (f * ) = 0, so that Reg Ψ (h f * ,θ * ) = 0, which means that h f * ,θ * is the minimizer of Ψ: where the second equality is due to f * = ψ(η) and strict monotonicity of ψ. Hence, h * Ψ is a threshold function on η. The proof of Lemma 1 (see Section 4) actually specifies the exact value of the threshold θ * : which is in agreement with the result obtained by Natarajan et al. (2014). 5 To make Lemma 1 easier to grasp, consider a special case when Ψ = FP + FN is the classification accuracy. In this case, (3) gives Ψ −1 (θ * ) = 1/2. Hence, we obtained the well-known result that the classifier maximizing the accuracy is a threshold function on η at 1/2. Then, Lemma 1 states that given a real-valued f , we should take a classifier h f,θ * which thresholds f at θ * = ψ(1/2) (one can verify that θ * = 0 for logistic, squared-error and exponential losses). The bounds from the lemma are in this case identical (up to a multiplicative constant) to the bounds by Bartlett et al. (2006).
Unfortunately, in general the optimal threshold θ * is unknown, as (3) contains an unknown quantity Ψ(h * Ψ ). The solution in this case is to, given f , directly search for a threshold which maximizes Ψ(h f,θ ). This is the main result of the paper: Theorem 2 Given a real-valued function f , letθ = arg max θ Ψ(h f,θ ). Then, under the assumptions and notation from Lemma 1: Proof The result follows immediately from Lemma 1: Theorem 2 motivates the following procedure for maximization of Ψ: 1. Find f with small -regret, e.g. by using a learning algorithm minimizing -risk on the training sample.
5. Natarajan et al. (2014) required some continuity assumptions to prove (3). Our analysis shows that these assumptions are not necessary.
Theorem 2 states that the Ψ-regret of the classifier obtained by this procedure is upperbounded by the -regret of the underlying real-valued function. We now shortly discuss how to approach step 2 of the procedure in practice. In principle, this step requires maximizing Ψ defined through FP and FN, which are expectations over an unknown distribution Pr(x, y). However, as long as Ψ does not change too rapidly (e.g. Ψ has bounded derivatives), it is sufficient to optimize θ on the empirical counterpart of Ψ calculated on a separate validation sample.
Step 2 involves optimization within a class of threshold functions (since f is fixed), which has VC-dimension equal to 2 (Devroye et al., 1996). If Ψ has bounded derivatives, there exist constants G 1 , G 2 such that:

Proof of Lemma 1
The proof can be skipped without affecting the flow of later sections. The proof consists of two steps. First, we bound the Ψ-regret of any classifier h by its cost-sensitive classification regret (introduced below). Next, we show that there exists a threshold θ * , such that for any f , the cost-sensitive classification regret of h f,θ * is upperbounded by the -regret of f . Bounding Ψ-regret by cost-sensitive regret. Given a real number α ∈ [0, 1], define a cost-sensitive classification loss α : {−1, 1} × {−1, 1} → R + as: α (y, y) = α y = −1 y = 1 + (1 − α) y = 1 y = −1 .
The cost-sensitive loss assigns different costs of misclassification for positive and negative labels. Given classifier h, the cost-sensitive risk of h is: and the cost-sensitive regret is: where h * α = arg min h L α (h). We now show that there exists α such that for any h, where C is defined as in the content of Lemma 1. For the sake of clarity, we use a shorthand notation Ψ = Ψ(h), Ψ * = Ψ(h * Ψ ), FP = FP(h), FN = FN(h), A = a 0 + a 1 FP + a 2 FN, B = b 0 + b 1 FP + b 2 FN for the numerator and denominator of Ψ(h), and analogously FP * , FN * , A * and B * for Ψ(h * Ψ ). In this notation: where the last inequality follows from B ≥ γ (assumption) and the fact that Reg Ψ (h) ≥ 0 for any h. Since Ψ is non-increasing in TP and FP, we have and similarly ∂Ψ * ∂FN * = a 2 −b 2 Ψ * B * ≤ 0. This and the assumption B * ≥ γ implies that both Ψ * b 1 − a 1 and Ψ * b 2 − a 2 are non-negative, so can be interpreted as misclassification costs. If we normalize the costs by defining: then (6) implies: Bounding cost-sensitive regret by -regret. We will show that there exists threshold θ * such that: This, along with (5) implies Lemma 1. First, we will show that (8) holds conditionally for every x. To this end, we fix x and deal with h(x) ∈ {−1, 1}, f (x) ∈ R and η(x) ∈ [0, 1], using a shorthand notation h, f, η.

Empirical results
We perform experiments on synthetic and benchmark data to empirically study the twostep procedure analyzed in the previous sections. As surrogate loss, we use logistic loss, which is 4-strongly proper composite, but we also verify how hinge loss, an example of loss function, which is not strongly-proper composite (not even proper composite), behaves in this procedure. As our task performance metrics, we take the F-measure (F β -measure with β = 1) and the AM measure. We could also use the Jaccard similarity coefficient; it turns out, however, that the threshold optimized for the F-measure coincides with the optimal threshold for the Jaccard similarity coefficient, so the latter measure does not give anything substantially different than the F-measure. The purpose of this study is not about comparing the two-step approach with alternative methods; this has already been done in the previous work on the subject, see, e.g., (Nan et al., 2012;Parambath et al., 2014). We also note that similar experiments can be found in the cited papers on the statistical consistency of generalized performance metrics Natarajan et al. (2014); Narasimhan et al. (2014);Parambath et al. (2014). In this study, we unavoidably repeat some of the results obtained therein, but we include them to make our paper self-contained and to emphasize the difference between proper composite losses and non-proper losses.

Synthetic data
We performed two experiments on synthetic data. The first experiment deals with a discrete domain in which we learn within a class of all possible classifiers. The second experiment concerns continuous domain in which we learn within a restricted class of linear functions.
First experiment. We let the input domain X to be a finite set, X = {1, 2, . . . , 25}, and take Pr(x) to be uniform over X. For each x ∈ X, we randomly draw a value of η(x) from the uniform distribution on the interval [0, 1]. In the first step, we take an algorithm which minimizes a given surrogate loss within the class of all function. Hence, given the training data of size n, the algorithm computes the empirical minimizer of loss independently for each x. As surrogate losses, we use logistic and hinge loss. In the second step, we tune the thresholdθ on a separate validation set, also of size n. For each n, we repeat the procedure 100,000 times, averaging over samples and over models (different random choices of η(x)). We start with n = 100 and increase the number of training examples up to n = 10, 000. The -regret and Ψ-regret can be easily computed, as the distribution is known and X is discrete.
The results are given in Fig. 1. The -regret goes down to zero for both surrogate losses, which is expected, since this is the objective function minimized by the algorithm. Minimization of logistic loss (top plot) gives vanishing Ψ-regret for both the F-measure and the AM measure, as predicted by Theorem 2. In contrast, minimization of the hinge loss (bottom plot) is suboptimal for both task metrics and gives non-zero Ψ-regret even in the limit n → ∞. This behavior can easily be explained by the fact that hinge loss is not a proper (composite) loss: the risk minimizer for hinge loss is given by f * (x) = sgn(η(x)−1/2) ( Bartlett et al., 2006). Hence, the hinge loss minimizer is already a threshold function on η(x), with the threshold value set to 1/2. If, for a given performance metric Ψ, the optimal  threshold θ * is different than 1/2, the hinge loss minimizer will necessarily have suboptimal Ψ-risk. This is clearly visible for the F-measure. The better result on the AM measure is explained by the fact that the average optimal threshold over all models is 0.5 for this measure, so the minimizer of hinge loss is not that far from the minimizer of AM measure.
Second experiment. We take X = R 2 and generate x ∈ X from a standard Gaussian distribution. We use a logistic model of the form η(x) = 1 1+exp(−a 0 −a x) . The weights a = (a 1 , a 2 ) and a 0 are also drawn from a standard Gaussian. For a given model (set of weights), we take training sets of increasing size from n = 100 up to n = 3000, using 20 different sets for each n. We also generate one test set of size 100,000. For each n, we use 2/3 of the training data to learn a linear model f (x) = w 0 + w x, using either support vector machines (SVM, with linear kernel) or logistic regression (LR). We use implementation of these algorithms from the LibLinear package Fan et al. (2008). 6 The remaining 1/3 of the training data is used for tuning the threshold. We average the results over 20 different models.
The results are given in Fig. 2. The results obtained for LR (logistic loss minimizer) agree with our theoretical analysis: the -regret and Ψ-regret with respect to both Fmeasure and AM measure go to zero. This is expected, as the data generating model is a linear logistic model, and thus coincides with a class of functions over which we optimize. The situation is different for SVM (hinge loss minimizer). Firstly, the -regret for hinge loss does not converge to zero. This is because the risk minimizer for hinge loss is a threshold function sgn(η(x) − 1/2), and it is not possible to approximate such a function with linear model f (x) = w 0 + w x. Hence, even when n → ∞, the empirical hinge loss minimizer (SVM) does not converge to the risk minimizer. This behavior, however, is advantageous for the performance of SVM in terms of F-measure. This is because the risk minimizer for hinge loss, a threshold function on η(x) with the threshold value 1/2, will perform poorly in terms of the F-measure, which minimizer has threshold θ * very different from 1/2. The linear model constraint will prevent convergence to the risk minimizer, and the resulting linear function f (x) = w 0 + w x will be close to some reversible function of η(x); hence dataset #examples #features covtype.binary 581,012 54 gisette 7,000 5,000 Table 4: Basic statistics for benchmark datasets after tuning the threshold, we will end up close to the minimizer of the F-measure. This is seen on the top panel in Fig. 2, where the F-regret of SVM is quite close to zero (but still worse than for LR). The performance of SVM in terms of AM measure is much worse and looks quite unstable. After inspecting the data, we learned that for some models with imbalanced class priors, SVM reduce weights w to zero and sets the intercept w 0 to 1 or −1, predicting the same value for all x ∈ X (this is not caused by a software problem, it is how the empirical loss minimizer behaves). This phenomenon negatively affects the performance in terms of AM measure (and, to a much lesser extent, in terms of F-measure).

Benchmark data
We also performed a similar experiment on two binary benchmark datasets. 7 , described in Table 5.2. We randomly take out a test set of size 181,012 for covtype, and of size 3,000 for gisette. We use the remaining examples for training. As before, we incrementally increase the size of the training set. We use 2/3 of training examples for learning linear model with SVM or LR, and the rest for tuning the threshold. We repeat the experiment (random train/validation/test split) 20 times. The results are plotted in Fig 3. Since the data distribution is unknown, we are unable to compute the risk minimizers, hence we plot the average loss/metric on the test set rather than the regret. The results show that SVM perform slightly better on the covtype dataset, while LR performs better on the gisette dataset. However, there is very little difference in performance of SVM and LR in terms of the F-measure and the AM measure on these data sets. We suspect this is due to the fact that η(x) function is very different from linear for these problems, so that neither LR nor SVM converge to the -risk minimizer, and Theorem 2 does not apply. Further study would be required to understand the behavior of surrogate losses in this case.

Summary
We presented a theoretical analysis of a two-step approach to optimize classification performance metrics, which first learns a real-valued function f on a training sample by minimizing a surrogate loss, and then tunes the threshold on f by optimizing the target performance metric on a separate validation sample. We showed that if the metric is a linear-fractional function, and the surrogate loss is strongly proper composite, then the regret of the resulting classifier (obtained from thresholding real-valued f ) measured with respect to the target metric is upperbounded by the regret of f measured with respect to the surrogate loss.
7. Datasets taken from LibSVM repository: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets A natural question is whether our results can be generalized to other classification performance metrics, not necessarily of the linear-fractional form (1). In general, the answer is negative: it is known (Narasimhan et al., 2014) that for some performance metrics, h * Ψ is not necessarily a threshold function on η, and a bound given by Lemma 1 cannot hold: plugging f = ψ(η) makes the right-hand side of the bound zero, while the left-hand side will remain nonzero for any value of the threshold. On the other hand, Narasimhan et al. (2014) showed that h * Ψ becomes a threshold function under some mild continuity assumptions on the distribution of η(x). It remains an open question whether regret bound is possible in this case.