Abstract
We consider optimization of generalized performance metrics for binary classification by means of surrogate losses. We focus on a class of metrics, which are linearfractional functions of the false positive and false negative rates (examples of which include \(F_{\beta }\)measure, Jaccard similarity coefficient, AM measure, and many others). Our analysis concerns the following twostep procedure. First, a realvalued function f is learned by minimizing a surrogate loss for binary classification on the training sample. It is assumed that the surrogate loss is a strongly proper composite loss function (examples of which include logistic loss, squarederror loss, exponential loss, etc.). Then, given f, a threshold \(\widehat{\theta }\) is tuned on a separate validation sample, by direct optimization of the target performance metric. We show that the regret of the resulting classifier (obtained from thresholding f on \(\widehat{\theta }\)) measured with respect to the target metric is upperbounded by the regret of f measured with respect to the surrogate loss. We also extend our results to cover multilabel classification and provide regret bounds for micro and macroaveraging measures. Our findings are further analyzed in a computational study on both synthetic and real data sets.
Introduction
In binary classification, misclassification error is not necessarily an adequate evaluation metric, and one often resorts to more complex metrics, better suited for the problem. For instance, when classes are imbalanced, \(F_{\beta }\)measure (Lewis 1995; Jansche 2005; Nan et al. 2012) and AM measure (balanced error rate) (Menon et al. 2013) are frequently used. Optimizing such generalized performance metrics poses computational and statistical challenges, as they cannot be decomposed into losses on individual observations.
In this paper, we consider optimization of generalized performance metrics by means of surrogate losses. We restrict our attention to a family of performance metrics which are ratios of linear functions of false positives (FP) and false negatives (FN). Such functions are called linearfractional, and include the aforementioned \(F_{\beta }\) and AM measures, as well as Jaccard similarity coefficient, weighted accuracy, and many others (Koyejo et al. 2014, 2015). We focus on the most popular approach to optimizing generalized performance metrics in practice, based on the following twostep procedure. First, a realvalued function f is learned by minimizing a surrogate loss for binary classification on the training sample. Then, given f, a threshold \(\widehat{\theta }\) is tuned on a separate validation sample, by direct optimization of the target performance measure with respect to a classifier obtained from f by thresholding at \(\widehat{\theta }\), classifying all observations with value of f above the threshold as positive class, and all observations below the threshold as negative class. This approach can be motivated by the asymptotic analysis: minimization of appropriate surrogate loss results in estimation of conditional (“posterior”) class probabilities, and many performance metrics are maximized by a classifier which predicts by thresholding on the scale of conditional probabilities (Nan et al. 2012; Zhao et al. 2013; Koyejo et al. 2014). However, it is unclear what can be said about the behavior of this procedure on finite samples.
In this work, we are interested in theoretical analysis and justification of this approach for any sample size, and for any, not necessarily perfect, classification function. To this end, we use the notion of regret with respect to some evaluation metric, which is the difference between the performance of a given classifier and the performance of the optimal classifier with respect to this metric. We show that the regret of the resulting classifier (obtained from thresholding f on \(\widehat{\theta }\)) measured with respect to the target metric is upperbounded by the regret of f measured with respect to the surrogate loss. Our result holds for any surrogate loss function, which is strongly proper composite (Agarwal 2014), examples of which include logistic loss, squarederror loss, exponential loss, etc. Interestingly, the proof of our result goes by an intermediate bound of the regret with respect to the target measure by a costsensitive classification regret. As a byproduct, we get a bound on the costsensitive classification regret by a surrogate regret of a realvalued function which holds simultaneously for all misclassification costs: the misclassification costs only influence the threshold, but not: the function, the surrogate loss, or the regret bound.
We further extend our results to cover multilabel classification, in which the goal is to simultaneously predict multiple labels for each object. We consider two methods of generalizing binary classification performance metrics to the multilabel setting: the macroaveraging and the microaveraging (Manning et al. 2008; Parambath et al. 2014; Koyejo et al. 2015). The macroaveraging is based on first computing the performance metric separately for each label, and then averaging the metrics over the labels. In the microaveraging, the false positives and false negatives for each label are first averaged over the labels, and then the performance metric is calculated on these averaged quantities. We show that our regret bounds hold for both macro and microaveraging measures. Interestingly, for micro averaging, only a single threshold needs to be tuned and is shared among all labels.
Our finding is further analyzed in a computational study on both synthetic and real data sets. We compare the performance of the algorithm when used with two types of surrogate losses: the logistic loss (which is strongly proper) and the hinge loss (which is not a proper loss). On synthetic data sets, we analyze the behavior of the algorithm for discrete feature distribution (where nonparametric classifiers are used), and for continuous feature distribution (where linear classifiers are used). Next, we look at the performance of the algorithm on reallife benchmark data sets, both for binary and multilabel classification.
We note that the goal of this paper is not to propose a new learning algorithm, but rather to provide a deeper statistical understanding of an existing method. The twostep procedure analyzed here (also known as the plugin method in the case when the outcomes of the function have a probabilistic interpretation) is commonly used in binary classification under generalized performance metrics, but this is exactly the reason why we think it is important to study this method in more depth from a theoretical point of view.
Related work
In machine learning, numerous attempts to optimize generalized performance metrics have been proposed. They can be divided into two general categories. The structured loss approaches (Musicant et al. 2003; Tsochantaridis et al. 2005; Petterson and Caetano 2011, 2010) rely on incorporating the performance metric into the training process, thus requiring specialized learning algorithms to optimize nonstandard objectives. On the other hand, the plugin approaches, which are very closely related to the topic of this work, are based on obtaining reliable class conditional probability estimates by employing standard algorithms minimizing some surrogate loss for binary classification (such as logistic loss used in logistic regression, exponential loss used in boosting, etc.), and then plugging these estimates into the functional form of the optimal prediction rule for a given performance metric (Jansche 2007; Nan et al. 2012; Dembczyński et al. 2013; Waegeman et al. 2013; Narasimhan et al. 2014, 2015; Koyejo et al. 2014, 2015).
Existing theoretical work on generalized performance metrics is mainly concerned with statistical consistency also known as calibration, which determines whether convergence to the minimizer of a surrogate loss implies convergence to the minimizer of the task performance measure as the sample size goes to infinity (Dembczyński et al. 2010; Nan et al. 2012; Gao and Zhou 2013; Zhao et al. 2013; Narasimhan et al. 2014; Koyejo et al. 2014, 2015). Here we give a stronger result which bounds the regret with respect to the performance metric by the regret with respect to the surrogate loss. Our result is valid for all finite sample sizes and informs about the rates of convergence.
We also note that two distinct frameworks are used to study the statistical consistency of classifiers with respect to performance metrics: Decision Theoretic Analysis (DTA), which assumes a test set of a fixed size, and Empirical Utility Maximization (EUM), in which the metric is defined by means of population quantities (Nan et al. 2012). In this context, our work falls into the EUM framework.
Parambath et al. (2014) presented an alternative approach to maximizing linearfractional metrics by learning a sequence of binary classification problems with varying misclassification costs. While we were inspired by their theoretical analysis, their approach is, however, more complicated than the twostep approach analyzed here, which requires solving an ordinary binary classification problem only once. Moreover, as part of our proof, we show that by minimizing a strongly proper composite loss, we are implicitly minimizing costsensitive classification error for any misclassification costs without any overhead. Hence, the costs need not be known during learning, and can only be determined later on a separate validation sample by optimizing the threshold. Narasimhan et al. (2015) developed a general framework for designing provably consistent algorithms for complex multiclass performance measures. They relate the regret with respect to the target metric to the conditional probability estimation error measured in terms of \(L_1\)metric. Their algorithms rely on using accurate class conditional probability estimates and multiple solving costsensitive multiclass classification problems.
The generalized performance metrics for binary classification are employed in the multilabel setting by means of one of the three averaging schemes (Waegeman et al. 2013; Parambath et al. 2014; Koyejo et al. 2015): instanceaveraging (averaging errors over the labels, averaging metric over the examples), macroaveraging (averaging errors over the examples, averaging metric over the labels), and microaveraging (averaging errors over the examples and the labels). Koyejo et al. (2015) characterize the optimal classifiers for multilabel metrics and prove the consistency of the plugin method. Our regret bounds for multilabel classification can be seen as a follow up on their work.
Outline
The paper is organized as follows. In Sect. 2 we introduce basic concepts, definitions and notation. The main result is presented in Sect. 3 and proved in Sect. 4. Section 5 extends our results to the multilabel setting. The theoretical contribution of the paper is complemented by computational experiments in Sect. 6, prior to concluding with a summary in Sect. 7.
Problem setting
Binary classification
In binary classification, the goal is, given an input (feature vector) \(x \in X\), to accurately predict the output (label) \(y \in \{1,1\}\). We assume input–output pairs (x, y) are generated i.i.d. according to \(\Pr (x,y)\). A classifier is a mapping \(h :X \rightarrow \{1,1\}\). Given h, we define the following four quantities:
which are known as true positives, false positives, true negatives and false negatives, respectively. We also denote \(\Pr (y=1)\) by P. Note that for any h, \({\mathrm {FP}}(h) + {\mathrm {TN}}(h) = \Pr (y=1) = 1P\) and \({\mathrm {TP}}(h) + {\mathrm {FN}}(h) = P\), so out of the four quantities above, only two are independent. In this paper, we use the convention to parameterize all metrics by means of \({\mathrm {FP}}(h)\) and \({\mathrm {FN}}(h)\).
We call a twoargument function \(\varPsi = \varPsi ({\mathrm {FP}},{\mathrm {FN}})\) a (generalized) classification performance metric. Given a classifier h, we define \(\varPsi (h) = \varPsi ({\mathrm {FP}}(h),{\mathrm {FN}}(h))\). Throughout the paper we assume that \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) is linearfractional, i.e., is a ratio of linear functions:
where we allow coefficients \(a_i,b_i\) to depend on the distribution \(\Pr (x,y)\). Note, that our convention to parameterize the metric by means of \(({\mathrm {FP}},{\mathrm {FN}})\) does not affect definition (1), because \(\varPsi \) can be reparameterized to \(({\mathrm {FP}},{\mathrm {TN}})\), \(({\mathrm {TP}},{\mathrm {FN}})\), or \(({\mathrm {TP}},{\mathrm {TN}})\), and will remain linearfractional in all these parameterizations. We also assume \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) is nonincreasing in \({\mathrm {FP}}\) and \({\mathrm {FN}}\), a property that is inherently possessed by virtually all performance measures used in practice. Table 1 lists some popular examples of linearfractional performance metrics.
Let \(h^*_{\varPsi }\) be the maximizer of \(\varPsi (h)\) over all classifiers:
(if \({\text {argmax}}\) is not unique, we take \(h^*_{\varPsi }\) to be any maximizer of \(\varPsi \)). Given any classifier h, we define its \(\varPsi \)regret as:
The \(\varPsi \)regret is nonnegative from the definition, and quantifies the suboptimality of h, i.e., how much worse is h comparing to the optimal \(h^*_{\varPsi }\).
Strongly proper composite losses
Here we briefly outline the theory of strongly proper composite loss functions. See Agarwal (2014) for a more detailed description.
Define a binary class probability estimation (CPE) loss function (Reid and Williamson 2010, 2011) as a function \(c :\{1,1\} \times [0,1] \rightarrow {\mathbb {R}}_+\), where \(c(y,\widehat{\eta })\) assigns penalty to prediction \(\widehat{\eta }\), when the observed label is y. Define the conditional crisk as:^{Footnote 1}
the expected loss of prediction \(\widehat{\eta }\) when the label is drawn from a distribution with \(\Pr (y=1) = \eta \). We say CPE loss is proper if for any \(\eta \in [0,1]\), \(\eta \in {\text {argmin}}_{\widehat{\eta } \in [0,1]} {\mathrm {risk}}_c(\eta ,\widehat{\eta })\). In other words, proper losses are minimized by taking the true class probability distribution as a prediction; hence \(\widehat{\eta }\) can be interpreted as probability estimate of \(\eta \). Define the conditional cregret as:
the difference between the conditional crisk of \(\widehat{\eta }\) and the optimal crisk. We say a CPE loss c is \(\lambda \)strongly proper if for any \(\eta , \widehat{\eta }\):
i.e., the conditional cregret is everywhere lowerbounded by a squared difference of its arguments. It can be shown (Agarwal 2014) that under mild regularity assumption a proper CPE loss c is \(\lambda \)strongly proper if and only if the function \(H_{c}(\eta ) := {\mathrm {risk}}_{c}(\eta ,\eta )\) is \(\lambda \)strongly concave. This fact lets us easily verify whether a given loss function is \(\lambda \)strongly proper.
It is often more convenient to reparameterize the loss function from \(\widehat{\eta } \in [0,1]\) to a realvalued \(f \in {\mathbb {R}}\) through a strictly increasing (and therefore invertible) link function \(\psi :[0,1] \rightarrow {\mathbb {R}}\):
If c is \(\lambda \)strongly proper, we call function \(\ell :\{1,1\} \times {\mathbb {R}} \rightarrow {\mathbb {R}}_+\) \(\lambda \)strongly proper composite loss function. The notions of conditional \(\ell \)risk \({\mathrm {risk}}_\ell (\eta ,f)\) and conditional \(\ell \)regret \({\mathrm {reg}}_{\ell }(\eta ,f)\) extend naturally to the case of composite losses:
and the strong properness of underlying CPE loss implies:
As an example, consider a logarithmic scoring rule:
where \(\llbracket Q \rrbracket \) is the indicator function, equal to 1 if Q holds, and to 0 otherwise. Its conditional risk is given by:
the crossentropy between \(\eta \) and \(\widehat{\eta }\). The conditional cregret is the binary Kullback–Leibler divergence between \(\eta \) and \(\widehat{\eta }\):
Note that since \(H(\eta ) = {\mathrm {risk}}_c(\eta ,\eta )\) is the binary entropy function, and \(\big  \frac{\text {d}^2 H}{\text {d}\eta ^2} \big  = \frac{1}{\eta (1\eta )} \ge 4\), c is 4strongly proper loss. Using the logit link function \(\psi (\widehat{\eta }) = \log \frac{\widehat{\eta }}{1\widehat{\eta }}\), we end up with the logistic loss function:
which is 4strongly proper composite from the definition.
Table 2 presents some of the commonly used losses which are strongly proper composite. Note that the hinge loss \(\ell (y,f) = (1yf)_+\), used, e.g., in support vector machines (Hastie et al. 2009), is not strongly proper composite (even not proper composite).
Main result
Given a realvalued function \(f :X \rightarrow {\mathbb {R}}\), and a \(\lambda \)strongly proper composite loss \(\ell (y,f)\), define the \(\ell \)risk of f as the expected loss of f(x) with respect to the data distribution:
where \(\eta (x) = \Pr (y=1x)\). Let \(f_{\ell }^*\) be the minimizer \({\mathrm {Risk}}_\ell (f)\) over all functions, \(f_{\ell }^* = {\text {argmin}}_f {\mathrm {Risk}}_\ell (f)\). Since \(\ell \) is proper composite:
Define the \(\ell \)regret of f as:
Any realvalued function \(f :X \rightarrow {\mathbb {R}}\) can be turned into a classifier \(h_{f,\theta } :X \rightarrow \{1,1\}\), by thresholding at some value \(\theta \):
The purpose of this paper is to address the following problem: given a function f with \(\ell \)regret \({\mathrm {Reg}}_{\ell }(f)\), and a threshold \(\theta \), what can we say about \(\varPsi \)regret of \(h_{f,\theta }\)? For instance, can we bound \({\mathrm {Reg}}_{\varPsi }(h_{f,\theta })\) in terms of \({\mathrm {Reg}}_{\ell }(f)\)? We give a positive answer to this question, which is based on the following regret bound:
Lemma 1
Let \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) be a linearfractional function of the form (1), which is nonincreasing in \({\mathrm {FP}}\) and \({\mathrm {FN}}\). Assume that there exists \(\gamma > 0\), such that for any classifier \(h :X \rightarrow \{1,1\}\):
i.e., the denominator of \(\varPsi \) is positive and bounded away from zero. Let \(\ell \) be a \(\lambda \)strongly proper composite loss function. Then, there exists a threshold \(\theta ^*\), such that for any realvalued function \(f :X \rightarrow {\mathbb {R}}\),
where \(C = \frac{1}{\gamma }\left( \varPsi (h_{\varPsi }^*)(b_1 + b_2)  (a_1 + a_2)\right) > 0\).
The proof is quite long and hence is postponed to Sect. 4. Interestingly, the proof goes by an intermediate bound of the \(\varPsi \)regret by a costsensitive classification regret. We note that the bound in Lemma 1 is in general unimprovable, in the sense that it is easy to find f, \(\varPsi \), \(\ell \), and distribution \(\Pr (x,y)\), for which the bound holds with equality (see proof for details). We split the constant in front of the bound into C and \(\lambda \), because C depends only on \(\varPsi \), while \(\lambda \) depends only on \(\ell \). Table 3 lists these constants for some popular metrics. We note that constant \(\gamma \) (lower bound on the denominator of \(\varPsi \)) will be distributiondependent in general (as it can depend on \(P=\Pr (y=1)\)) and may not have a uniform lower bound which holds for all distributions.
Lemma 1 has the following interpretation. If we are able to find a function f with small \(\ell \)regret, we are guaranteed that there exists a threshold \(\theta ^*\) such that \(h_{f,\theta ^*}\) has small \(\varPsi \)regret. Note that the same threshold \(\theta ^*\) will work for any f, and the right hand side of the bound is independent of \(\theta ^*\). Hence, to minimize the right hand side we only need to minimize \(\ell \)regret, and we can deal with the threshold afterwards.
Lemma 1 also reveals the form of the optimal classifier \(h_{\varPsi }^*\): take \(f=f^*_{\ell }\) in the lemma and note that \({\mathrm {Reg}}_{\ell }(f^*_{\ell })=0\), so that \({\mathrm {Reg}}_{\varPsi }(h_{f^*_{\ell },\theta ^*}) = 0\), which means that \(h_{f^*_{\ell },\theta ^*}\) is the minimizer of \(\varPsi \):
where the second equality is due to \(f^*_{\ell } = \psi (\eta )\) and strict monotonicity of \(\psi \). Hence, \(h^*_{\varPsi }\) is a threshold function on \(\eta \). The proof of Lemma 1 (see Sect. 4) actually specifies the exact value of the threshold \(\theta ^*\):
which is in agreement with the result obtained by Koyejo et al. (2014).^{Footnote 2}
To make Lemma 1 easier to grasp, consider a special case when the performance metric \(\varPsi ({\mathrm {FP}},{\mathrm {FN}}) = 1  {\mathrm {FP}} {\mathrm {FN}}\) is the classification accuracy. In this case, (3) gives \(\varPsi ^{1}(\theta ^*) = 1/2\). Hence, we obtained the wellknown result that the classifier maximizing the accuracy is a threshold function on \(\eta \) at 1 / 2. Then, Lemma 1 states that given a realvalued f, we should take a classifier \(h_{f,\theta ^*}\) which thresholds f at \(\theta ^* = \psi (1/2)\). Using Table 2, one can easily verify that \(\theta ^* = 0\) for logistic, squarederror and exponential loss. This agrees with the common approach of thresholding the realvalued classifiers trained by minimizing these losses at 0 to obtain the label prediction. The bounds from the lemma are in this case identical (up to a multiplicative constant) to the bounds obtained by Bartlett et al. (2006).
Unfortunately, for more complicated performance metrics, the optimal threshold \(\theta ^*\) is unknown, as (3) contains unknown quantity \(\varPsi (h^*_{\varPsi })\), the value of the metric at optimum. The solution in this case is to, given f, directly search for a threshold which maximizes \(\varPsi (h_{f,\theta })\). This is the main result of the paper:
Theorem 1
Given a realvalued function f, let \(\theta ^*_f = {\text {argmax}}_{\theta } \varPsi (h_{f,\theta })\). Then, under the assumptions and notation from Lemma 1:
Proof
The result follows immediately from Lemma 1: Solving \(\max _{\theta }\varPsi (h_{f,\theta })\) is equivalent to solving \(\min _\theta {\mathrm {Reg}}_{\varPsi }(h_{f,\theta })\), and \(\min _\theta {\mathrm {Reg}}_{\varPsi }(h_{f,\theta }) \le {\mathrm {Reg}}_{\varPsi }(h_{f,\theta ^*})\), where \(\theta ^*\) is the threshold given by Lemma 1. \(\square \)
Theorem 1 motivates the following procedure for maximization of \(\varPsi \):

1.
Find f with small \(\ell \)regret, e.g., by using a learning algorithm minimizing \(\ell \)risk on the training sample.

2.
Given f, solve \(\theta ^*_f = {\text {argmax}}_{\theta } \varPsi (h_{f,\theta })\).
Theorem 1 states that the \(\varPsi \)regret of the classifier obtained by this procedure is upperbounded by the \(\ell \)regret of the underlying realvalued function.
We now discuss how to approach step 2 of the procedure in practice. In principle, this step requires maximizing \(\varPsi \) defined through \({\mathrm {FP}}\) and \({\mathrm {FN}}\), which are expectations over an unknown distribution \(\Pr (x,y)\). However, it is sufficient to optimize \(\theta \) on the empirical counterpart of \(\varPsi \) calculated on a separate validation sample. Let \(\mathcal {T} = \{(x_i,y_i)\}_{i=1}^n\) be the validation set of size n. Define:
the empirical counterparts of \({\mathrm {FP}}\) and \({\mathrm {FN}}\), and let \(\widehat{\varPsi }(h) = \varPsi (\widehat{{\mathrm {FP}}}(h),\widehat{{\mathrm {FN}}}(h))\) be the empirical counterpart of the performance metric \(\varPsi \). We now replace step 2 by:
Given f and validation sample \(\mathcal {T}\), solve \(\widehat{\theta }_f = {\text {argmax}}_{\theta } \widehat{\varPsi }(h_{f,\theta })\).
In Theorem 2 below, we show that:
so that tuning the threshold on the validation sample of size n (which results in \(\widehat{\theta }_f\)) instead of on the population level (which results in \(\theta ^*_f\)) will cost at most \(O\Big (\sqrt{\frac{\log n}{n}}\Big )\) additional regret. The main idea of the proof is that finding the optimal threshold comes down to optimizing within a class of \(\{1,1\}\)valued threshold functions, which has small Vapnik–Chervonenkis dimension. This, together with the fact that under assumptions from Lemma 1, \(\varPsi \) is stable with respect to its arguments, implies that \(\varPsi (h_{f,\widehat{\theta }_f})\) is close to \(\varPsi (h_{f,\theta ^*_f})\).
Theorem 2
Let the assumptions from Lemma 1 hold, and let:
and \(D = \max \{D_1,D_2\}\). Given a realvalued function f, and a validation set \(\mathcal {T}\) of size n generated i.i.d. from P(x, y), let \(\widehat{\theta }_f = {\text {argmax}}_{\theta } \widehat{\varPsi } (h_{f,\theta })\) be the threshold maximizing the empirical counterpart of \(\varPsi \) evaluated on \(\mathcal {T}\). Then, with probability \(1\delta \) (over the random choice of \(\mathcal {T}\)):
Proof
For any \({\mathrm {FP}}\) and \({\mathrm {FN}}\), we have:
and similarly,
For any \(({\mathrm {FP}},{\mathrm {FN}})\) and \(({\mathrm {FP}}',{\mathrm {FN}}')\), Taylorexpanding \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) around \(({\mathrm {FP}}',{\mathrm {FN}}')\) up to the first order and using the bounds above gives:
Now, we have:
where we used Theorem 1. Thus, it amounts to bound \(\varPsi (h_{f,\theta ^*_f})  \varPsi (h_{f,\widehat{\theta }_f})\). From the definition of \(\widehat{\theta }_f\), \(\widehat{\varPsi }(h_{f,\widehat{\theta }_f}) \ge \widehat{\varPsi }(h_{f,\theta ^*_f})\), hence:
where we used the definition of \(\widehat{\varPsi }\). Using (4),
Note that the suprema above are on the deviation of empirical mean from the expectation over the class of threshold functions, which has Vapnik–Chervonenkis dimension equal to 2. Using standard argument from Vapnik–Chervonenkis theory (see, e.g., Devroye et al. 1996), with probability \(1  \frac{\delta }{2}\) over the random choice of \(\mathcal {T}\):
and similarly for the second supremum. Thus, with probability \(1\delta \),
which finishes the proof. \(\square \)
We note that, contrary to a similar results by Koyejo et al. (2014), Theorem 2 does not require continuity of the cumulative distribution of \(\eta (x)\) around \(\theta ^*\).
Proof of Lemma 1
The proof can be skipped without affecting the flow of later sections. The proof consists of two steps. First, we bound the \(\varPsi \)regret of any classifier h by its costsensitive classification regret (introduced below). Next, we show that there exists a threshold \(\theta ^*\), such that for any f, the costsensitive classification regret of \(h_{f,\theta ^*}\) is upperbounded by the \(\ell \)regret of f. These two steps will be formalized as Proposition 1 and Proposition 2.
Given a real number \(\alpha \in [0,1]\), define a costsensitive classification loss \(\ell _\alpha :\{1,1\} \times \{1,1\} \rightarrow {\mathbb {R}}_+\) as:
The costsensitive loss assigns different costs of misclassification for positive and negative labels. Given classifier h, the costsensitive risk of h is:
and the costsensitive regret is:
where \(h^*_{\alpha } = {\text {argmin}}_h {\mathrm {Risk}}_{\alpha }(h)\). We now show the following two results:
Proposition 1
Let \(\varPsi \) satisfy the assumptions from Lemma 1. Define:
Then, \(\alpha \in [0,1]\) and for any classifier h,
where C is defined as in the content of Lemma 1.
Proof
The proof generalizes the proof of Proposition 6 from Parambath et al. (2014), which concerned the special case of \(F_{\beta }\)measure. For the sake of clarity, we use a shorthand notation \(\varPsi = \varPsi (h)\), \(\varPsi ^* = \varPsi (h^*_{\varPsi })\), \({\mathrm {FP}}= {\mathrm {FP}}(h)\), \({\mathrm {FN}}= {\mathrm {FN}}(h)\), \(A = a_0 + a_1 {\mathrm {FP}}+ a_2 {\mathrm {FN}}\), \(B = b_0 + b_1 {\mathrm {FP}}+ b_2 {\mathrm {FN}}\) for the numerator and denominator of \(\varPsi (h)\), and analogously \({\mathrm {FP}}^*\), \({\mathrm {FN}}^*\), \(A^*\) and \(B^*\) for \(\varPsi (h^*_{\varPsi })\). In this notation:
where the last inequality follows from \(B \ge \gamma \) (assumption) and the fact that \({\mathrm {Reg}}_{\varPsi }(h) \ge 0\) for any h. Since \(\varPsi \) is nonincreasing in \({\mathrm {FP}}\) and \({\mathrm {FN}}\), we have
and similarly \(\frac{\partial \varPsi ^*}{\partial {\mathrm {FN}}^*} = \frac{a_2  b_2 \varPsi ^*}{B^*} \le 0\). This and the assumption \(B^* \ge \gamma \) implies that both \(\varPsi ^* b_1  a_1\) and \(\varPsi ^* b_2  a_2\) are nonnegative, so can be interpreted as misclassification costs. If we normalize the costs by defining:
then (6) implies:
\(\square \)
Proposition 2
For any realvalued function \(f :X \rightarrow {\mathbb {R}}\) any \(\lambda \)strongly proper composite loss \(\ell \) with link function \(\psi \), and any \(\alpha \in [0,1]\):
where \(\theta ^* = \psi (\alpha )\).
Proof
First, we will show that (7) holds conditionally for every x. To this end, we fix x and deal with \(h(x) \in \{1,1\}\), \(f(x) \in {\mathbb {R}}\) and \(\eta (x) \in [0,1]\), using a shorthand notation \(h,f,\eta \).
Given \(\eta \in [0,1]\) and \(h \in \{1,1\}\), define the conditional costsensitive risk as:
Let \(h_{\alpha }^* = {\text {argmin}}_{h} {\mathrm {risk}}_{\alpha }(\eta ,h)\). It can be easily verified that:
Define the conditional costsensitive regret as
Note that if \(h = h_{\alpha }^*\), then \({\mathrm {reg}}_\alpha (\eta ,h) = 0\). Otherwise, \({\mathrm {reg}}_\alpha (\eta ,h) = \eta  \alpha \), so that:
Now assume \(h = {\mathrm {sgn}}(\widehat{\eta }  \alpha )\) for some \(\widehat{\eta }\), i.e., h is of the same form as \(h_{\alpha }^*\) in (8), with \(\eta \) replaced by \(\widehat{\eta }\). We show that for such h,
This statement trivially holds when \(h = h_{\alpha }^*\). If \(h \ne h_{\alpha }^*\), then \(\eta \) and \(\widehat{\eta }\) are on the opposite sides of \(\alpha \) (i.e., either \(\eta \ge \alpha \) and \(\widehat{\eta } < \alpha \) or \(\eta < \alpha \) and \(\widehat{\eta } \ge \alpha \)), hence \(\eta  \alpha  \le \eta  \widehat{\eta }\), which proves (9).
Now, we set the threshold to \(\theta ^* = \psi (\alpha )\), so that given \(f \in {\mathbb {R}}\),
due to strict monotonicity of \(\psi \). Using (9) with \(h = h_{f,\theta ^*}\) and \(\widehat{\eta } = \psi ^{1}(f)\) gives:
and the last inequality follows from strong properness (2).
To prove the unconditional statement (7), we take expectation with respect to x on both sides of (10):
where the second inequality is from Jensen’s inequality applied to the concave function \(x \mapsto \sqrt{x}\).
We note that derivation of (9) follows the steps of the proof of Lemma 4 in Menon et al. (2013), while (10) and (11) were shown in the proof of Theorem 13 by Agarwal (2014). Hence, the proof is essentially a combination of existing results, which are rederived here for the sake of completeness. \(\square \)
Proof
(of Lemma 1). Lemma 1 immediately follows from Propositions 1 and 2. \(\square \)
Note that the proof actually specifies the exact value of the universal threshold, \(\theta ^* = \psi (\alpha )\), where \(\alpha \) is given by (5).
The bound in Lemma 1 is unimprovable in a sense that there exist f, \(\varPsi \), \(\ell \), and distribution \(\Pr (x,y)\) for which the bound is tight. To see this, take, for instance, squared error loss \(\ell (y,f) = (yf)^2\) and classification accuracy metric \(\varPsi ({\mathrm {FP}},{\mathrm {FN}}) = 1{\mathrm {FP}}{\mathrm {FN}}\). The constants in Lemma 1 are equal to \(\gamma = 1\), \(C = 2\), and \(\lambda = 8\) (see Table 1), while the optimal threshold is \(\theta ^* = 0\). The bound then simplifies to
which is known to be tight (Bartlett et al. 2006).
Multilabel classification
In multilabel classification (Dembczyński et al. 2012; Parambath et al. 2014; Koyejo et al. 2015), the goal is, given an input (feature vector) \(x \in X\), to simultaneously predict the subset \(L \subseteq \mathcal {L}\) of the set of m labels \(\mathcal {L} = \{\sigma _1,\ldots ,\sigma _m\}\). The subset L is often called the set of relevant (positive) labels, while the complement \(\mathcal {L} {\setminus } L\) is considered as irrelevant (negative) for x. We identify a set L of relevant labels with a vector \({\varvec{y}}= (y_1, y_2, \ldots , y_m)\), \(y_i \in \{1,1\}\), in which \(y_i = 1\) iff \(\sigma _i \in L\). We assume observations \((x,{\varvec{y}})\) are generated i.i.d. according to \(\Pr (x,{\varvec{y}})\) (note that the labels are not assumed to be independent). A multilabel classifier:
is a mapping \({\varvec{h}}:X \rightarrow \{1,1\}^m\), which assigns a (predicted) label subset to each instance \(x \in X\). For any \(i=1,\ldots ,m\), the function \(h_i(x)\) is thus a binary classifier, which can be evaluated by means of \({\mathrm {TP}}_i(h_i)\),\({\mathrm {FP}}_i(h_i)\),\({\mathrm {TN}}_i(h_i)\) and \({\mathrm {FN}}_i(h_i)\), which are true/false positives/negatives defined with respect to label \(y_i\), e.g., \({\mathrm {FP}}_i(h_i) = \Pr (h_i(x) = 1 \wedge y_i = 1)\).
Let \(f_1,\ldots ,f_m\) be a set of realvalued functions \(f_i :X \rightarrow {\mathbb {R}}\), \(i=1,\ldots ,m\), and let \(\ell \) be a \(\lambda \)strongly proper composite loss for binary classification. For each \(i=1,\ldots ,m\), we let \({\mathrm {Risk}}^i_{\ell }(f_i)\) and \({\mathrm {Reg}}^i_{\ell }(f_i)\) denote the \(\ell \)risk and the \(\ell \)regret of function \(f_i\) with respect to label \(y_i\):
Note that the problem has been decomposed into m independent binary problems and the functions can be obtained by training m independent realvalued binary classifiers by minimizing loss \(\ell \) on the training sample, one for each out of m labels.
What follows next depends on the way in which the binary classification performance metric is applied in the multilabel setting. We consider two ways of turning binary classification metric into multilabel metric: the macroaveraging and the microaveraging (Manning et al. 2008; Parambath et al. 2014; Koyejo et al. 2015).
Macroaveraging
Given a binary classification performance metric \(\varPsi (h) = \varPsi ({\mathrm {FP}}(h),{\mathrm {FN}}(h)\), and a multilabel classifier \({\varvec{h}}\), we define the macroaveraged metric \(\varPsi _{\mathrm {macro}}({\varvec{h}})\) as:
The macroaveraging is thus based on first computing the performance metric separately for each label, and then averaging the metrics over the labels. The \(\varPsi _{\mathrm {macro}}\)regret is then defined as:
where \({\varvec{h}}_{\varPsi }^* = (h^*_{\varPsi ,1},\ldots ,h^*_{\varPsi ,m})\) is the \(\varPsi \)optimal multilabel classifier:
Since the regret decomposes into a weighted sum, it is straightforward to apply previously derived bound to obtain a regret bound for macroaveraged performance metric.
Theorem 3
Let \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) and \(\ell \) satisfy the assumptions of Lemma 1. For a set of m realvalued functions \(\{ f_i :X \rightarrow {\mathbb {R}} \}_{i=1}^m\), let \(\theta ^*_{f_i} = {\text {argmax}}_{\theta } \varPsi (h_{f_i,\theta })\) for each \(i=1,\ldots ,m\). Then the classifier \({\varvec{h}}\) defined as:
achieves the following bound on its \(\varPsi _{\mathrm {macro}}\)regret:
where \(C_i = \frac{1}{\gamma }\left( \varPsi (h_{\varPsi ,i}^*)(b_1 + b_2)  (a_1 + a_2)\right) \), \(i=1,\ldots ,m\).
Proof
The theorem follows from applying Theorem 1 once for each label, and then averaging the bounds over the labels. \(\square \)
Theorem 3 suggests a straightforward decomposition into m independent binary classification problems, one for each label \(y_1,\ldots ,y_m\), and running (independently for each problem) the twostep procedure described in Sect. 3: For \(i=1,\ldots ,m\), we learn a function \(f_i\) with small \(\ell \)regret with respect to label \(y_i\), and tune the threshold \(\theta ^*_{f_i}\) to optimize \(\varPsi (h_{f_i,\theta })\) (similarly as in the binary classification case, one can show that tuning the threshold on a separate validation sample is sufficient). Due to decomposition of \(\varPsi _{\mathrm {macro}}\) into the sum over the labels, this simple procedure turns out to be sufficient. As we shall see, the case of microaveraging becomes more interesting.
Microaveraging
Given a binary classification performance metrics \(\varPsi (h) = \varPsi ({\mathrm {FP}}(h),{\mathrm {FN}}(h))\), and a multilabel classifier \({\varvec{h}}\), we define the microaveraged metric \(\varPsi _{\mathrm {micro}}({\varvec{h}})\) as:
where
Thus, in the microaveraging, the false positives and false negatives are first averaged over the labels, and then the performance metric is calculated on these averaged quantities. The \(\varPsi _{\mathrm {micro}}\)regret:
does not decompose into the sum over labels anymore. However, we are still able to obtain a regret bound, reusing the techniques from Sect. 4, and, interestingly, this time only a single threshold needs to be tuned and is shared among all labels.^{Footnote 3}
Theorem 4
Let \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) and \(\ell \) satisfy the assumptions of Lemma 1. For a set of m realvalued functions \(\{ f_i :X \rightarrow {\mathbb {R}} \}_{i=1}^m\), let \(\theta ^*_f = {\text {argmax}}_{\theta } \varPsi _{\mathrm {micro}}({\varvec{h}}_{f,\theta })\), where
Then, the classifier \({\varvec{h}}_{f,\theta ^*_f} = (h_{f_1,\theta ^*_f}, \ldots , h_{f_m,\theta ^*_f})\) achieves the following bound on its \(\varPsi _{\mathrm {micro}}\)regret:
where \(C = \frac{1}{\gamma }\left( \varPsi _{\mathrm {micro}}({\varvec{h}}_{\varPsi }^*)(b_1 + b_2)  (a_1 + a_2)\right) \).
Proof
The proof follows closely the proof of Lemma 1. In fact, only Proposition 1 requires modifications, which are given below. Take any real values \({\mathrm {FP}},{\mathrm {FN}}\) and \({\mathrm {FP}}^*,{\mathrm {FN}}^*\) (to be specified later) in the domain of \(\varPsi \), such that:
Using exactly the same steps as in the derivation (6), we obtain:
where
Now, we take: \({\mathrm {FP}}^* = \overline{{\mathrm {FP}}}({\varvec{h}}_{\varPsi }^*), {\mathrm {FN}}^* = \overline{{\mathrm {FN}}}({\varvec{h}}_{\varPsi }^*)\), \({\mathrm {FP}}= \overline{{\mathrm {FP}}}({\varvec{h}})\) and \({\mathrm {FN}}= \overline{{\mathrm {FN}}}({\varvec{h}})\) for some \({\varvec{h}}\). Hence, (12) is clearly satisfied as its lefthand side is just the \(\varPsi _{\mathrm {micro}}\)regret, \({\mathrm {Reg}}_{\varPsi _{\mathrm {micro}}}({\varvec{h}})\). This means that for any multilabel classifier \({\varvec{h}}\):
where \({\mathrm {Risk}}^i_{\alpha }(h_i)\) and \({\mathrm {Reg}}^i_{\alpha }(h_i)\) are the costsensitive risk and the cost sensitive regret defined with respect to label \(y_i\):
If we now take \(h_i = h_{f,\theta ^*}\), where \(\theta ^* = \psi (\alpha )\), \(\psi \) being the link function of the loss, Proposition 2 (applied for each \(i=1,\ldots ,m\) separately) implies:
Together, this gives:
The theorem now follows by noticing that:
and thus \({\mathrm {Reg}}_{\varPsi _{\mathrm {micro}}}({\varvec{h}}_{f,\theta ^{*}_{f}}) \le {\mathrm {Reg}}_{\varPsi _{\mathrm {micro}}}({\varvec{h}}_{f,\theta ^{*}})\). \(\square \)
Theorem 4 suggests a decomposition into m independent binary classification problems, one for each label \(y_1,\ldots ,y_m\), and training m realvalued classifiers \(f_1,\ldots ,f_m\) with small \(\ell \)regret on the corresponding label. Then, however, contrary to macroaveraging, a single threshold, shared among all labels, is tuned by optimizing \(\varPsi _{\mathrm {micro}}\) on a separate validation sample.
Empirical results
We perform experiments on synthetic and benchmark data to empirically study the twostep procedure analyzed in the previous sections. To this end, we minimize a surrogate loss in the first step to obtain a realvalued function f, and in the second step, we tune a threshold \(\hat{\theta }\) on a separate validation set to optimize a given performance metric. We use logistic loss in this procedure as a surrogate loss. Recall that logistic loss is 4strongly proper composite (see Table 2). We compare its performance with hinge loss, which is even not a proper composite function.
As our task performance metrics, we take the Fmeasure (\(F_{\beta }\)measure with \(\beta =1\)) and the AM measure (which is a special case of Weighted Accuracy with weights \(w_1 = P\) and \(w_2 = 1P\)). We could also use the Jaccard similarity coefficient; it turns out, however, that the threshold optimized for the Fmeasure coincides with the optimal threshold for the Jaccard similarity coefficient (this is because the Jaccard similarity coefficient is strictly monotonic in the Fmeasure and vice versa), so the latter measure does not give anything substantially different than the Fmeasure.
The experiments on benchmark data are split into two parts. The first part concerns binary classification problems, while the second part multilabel classification.
The purpose of this study is not about comparing the twostep approach with alternative methods; this has already been done in the previous work on the subject, see, e.g., Nan et al. (2012) and Parambath et al. (2014). We also note that similar experiments have been performed in the cited papers on the statistical consistency of generalized performance metrics (Koyejo et al. 2014; Narasimhan et al. 2014; Parambath et al. 2014; Koyejo et al. 2015). Therefore, we unavoidably repeat some of the results obtained therein, but the main novelty of the experiments reported here is that we emphasize the difference between strongly proper composite losses and nonproper losses.
Synthetic data
We performed two experiments on synthetic data. The first experiment deals with a discrete domain in which we learn within a class of all possible classifiers. The second experiment concerns continuous domain in which we learn within a restricted class of linear functions.
First experiment We let the input domain X to be a finite set, consisting of 25 elements, \(X=\{1,2,\ldots ,25\}\), and take \(\Pr (x)\) to be uniform over X, i.e., \(\Pr (x=i)=1/25\). For each \(x \in X\), we randomly draw a value of \(\eta (x)\) from the uniform distribution on the interval [0, 1]. In the first step, we take an algorithm which minimizes a given surrogate loss \(\ell \) within the class of all function \(f :X \rightarrow {\mathbb {R}}\). Hence, given the training data of size n, the algorithm computes the empirical minimizer of surrogate loss \(\ell \) independently for each x. As surrogate losses, we use logistic and hinge loss. In the second step, we tune the threshold \(\hat{\theta }\) on a separate validation set, also of size n. For each n, we repeat the procedure 100,000 times, averaging over samples and over models (different random choices of \(\eta (x)\)). We start with \(n=100\) and increase the number of training examples up to \(n=10{,}000\). The \(\ell \)regret and \(\varPsi \)regret can be easily computed, as the distribution is known and X is discrete.
The results are given in Fig. 1. The \(\ell \)regret goes down to zero for both surrogate losses, which is expected, since this is the objective function minimized by the algorithm. Minimization of logistic loss (left plot) gives vanishing \(\varPsi \)regret for both the Fmeasure and the AM measure, as predicted by Theorem 1. In contrast, minimization of the hinge loss (right plot) is suboptimal for both task metrics and gives nonzero \(\varPsi \)regret even in the limit \(n \rightarrow \infty \). This behavior can easily be explained by the fact that hinge loss is not a proper (composite) loss: the risk minimizer for hinge loss is given by \(f^*_{\ell }(x) = {\mathrm {sgn}}(\eta (x)1/2)\) (Bartlett et al. 2006). Hence, the hinge loss minimizer is already a threshold function on \(\eta (x)\), with the threshold value set to 1 / 2. If, for a given performance metric \(\varPsi \), the optimal threshold \(\theta ^*\) is different than 1 / 2, the hinge loss minimizer will necessarily have suboptimal \(\varPsi \)risk. This is clearly visible for the Fmeasure. The better result on the AM measure is explained by the fact that the average optimal threshold over all models is 0.5 for this measure, so the minimizer of hinge loss is not that far from the minimizer of AM measure.
Second experiment We take \(X = {\mathbb {R}}^2\) and generate \(x \in X\) from a standard Gaussian distribution. We use a logistic model of the form \(\eta (x) = \frac{1}{1 + \exp (a_0  a^\top x)}\). The weights \(a = (a_1, a_2)\) and \(a_0\) are also drawn from a standard Gaussian. For a given model (set of weights), we take training sets of increasing size from \(n=100\) up to \(n=3000\), using 20 different sets for each n. We also generate one test set of size 100,000. For each n, we use 2/3 of the training data to learn a linear model \(f(x) = w_0 + w^\top x\), using either support vector machines (SVM, with linear kernel) or logistic regression (LR). We use implementation of these algorithms from the LibLinear package (Fan et al. 2008).^{Footnote 4} The remaining 1/3 of the training data is used for tuning the threshold. We average the results over 20 different models.
The results are given in Fig. 2. As before, we plot the average \(\ell \)regret for logistic and hinge loss, and \(\varPsi \)regret for the Fmeasure and the AM measure. The results obtained for LR (logistic loss minimizer) agree with our theoretical analysis: the \(\ell \)regret and \(\varPsi \)regret with respect to both Fmeasure and AM measure go to zero. This is expected, as the data generating model is a linear logistic model (so that the risk minimizer for logistic loss is a linear function), and thus coincides with a class of functions over which we optimize. The situation is different for SVM (hinge loss minimizer). Firstly, the \(\ell \)regret for hinge loss does not converge to zero. This is because the risk minimizer for hinge loss is a threshold function \({\mathrm {sgn}}(\eta (x)  1/2)\), and it is not possible to approximate such a function with linear model \(f(x) = w_0 + w^\top x\). Hence, even when \(n \rightarrow \infty \), the empirical hinge loss minimizer (SVM) does not converge to the risk minimizer. This behavior, however, can be advantageous for SVM in terms of the task performance measures. This is because the risk minimizer for hinge loss, a threshold function on \(\eta (x)\) with the threshold value 1 / 2, will perform poorly, for example, in terms of the Fmeasure and AM measure, for which the optimal threshold \(\theta ^*\) is usually very different from 1 / 2. In turn, the linear model constraint will prevent convergence to the risk minimizer, and the resulting linear function \(f(x) = w_0 + w^\top x\) will often be close to some reversible function of \(\eta (x)\); hence after tuning the threshold, we will often end up close to the minimizer of a given task performance measure. This is seen for the Fmeasure on the left panel in Fig. 2. In this case, the Fregret of SVM gets quite close to zero, but is still worse than LR. The nonvanishing regret is mainly caused by the fact that for some models with imbalanced class priors, SVM reduce weights w to zero and sets the intercept \(w_0\) to 1 or \(1\), predicting the same value for all \(x \in X\) (this is not caused by a software problem, it is how the empirical loss minimizer behaves). Interestingly, the Fmeasure is only slightly affected by this pathological behavior of empirical hinge loss minimizer. In turn, the AM measure, for which the plots are drawn in the right panel in Fig. 2, is not robust against this behavior of SVM: predicting the majority class actually results in the value of AM measure equal to 1 / 2, a very poor performance, which is on the same level as random classifier.
Benchmark data for binary classification
The next experiment is performed on two binary benchmark datasets,^{Footnote 5} described in Table 4. We randomly take out a test set of size 181,012 for covtype, and of size 3000 for gisette. We use the remaining examples for training. As before, we incrementally increase the size of the training set. We use 2/3 of training examples for learning linear model with SVM or LR, and the rest for tuning the threshold. We repeat the experiment (random train/validation/test split) 20 times. The results are plotted in Fig. 3. Since the data distribution is unknown, we are unable to compute the risk minimizers, hence we plot the average loss/metric on the test set rather than the regret. The results show that SVM perform better on the covtype dataset, while LR performs better on the gisette dataset. However, there is very little difference in performance of SVM and LR in terms of the Fmeasure and the AM measure on these data sets. We suspect this is due to the fact that \(\eta (x)\) function is very different from linear for these problems, so that neither LR nor SVM converge to the \(\ell \)risk minimizer, and Theorem 1 does not apply. Further studies would be required to understand the behavior of surrogate losses in this case.
Benchmark data for multilabel classification
In the last experiment we use three multilabel benchmark data sets.^{Footnote 6} Table 5 provides a summary of basic statistics of these datasets. The aim of the experiment is to verify the theoretical results in Sect. 5 on learning the micro and macroaveraged performance metrics. We use the Fmeasure and the AMmeasure as in previous experiments.
The data sets are already split into the training and testing parts. As before we train a linear model using either SVM or LR on 2/3 of training examples. The rest of training data is used for tuning the threshold. For optimizing macroaveraged measures, we tune the threshold separately for each label. This approach agrees with our analysis given in Sect. 5.1. For microaveraging, we tune a common threshold for all labels: we simply collect predictions for all labels and find the best threshold using these values. This approach is justified by the theoretical analysis in Sect. 5.2. Hence, the only difference between micro and macroversions of the algorithms is whether a single or multiple thresholds are tuned. In total we use 8 algorithms: two learning algorithms (LR/SVM), two performance measures (F/AM), and two types of averaging (Macro/Micro). Note that our experiments include evaluating algorithms tuned for macroaveraging in terms of microaveraged metrics, and vice versa. The goal of such crossanalysis is to determine the impact of threshold sharing for both averaging schemes. As before, we incrementally increase the size of the training set and repeat training and threshold tuning 20 times (we use random draws of training instances into the proper training and the validation parts; the test set is always the same, as originally specified for each data set). The results are given in Fig. 4.
The plots generally agree with the conclusions coming from the theoretical analysis, with some intriguing exceptions, however. As expected, LR tuned for a given performance metric gets the best result with respect to that metric in most of the cases. For the scene data set, however, the methods tuned for the microaveraged metrics (single threshold shared among labels) outperform the ones tuned for macroaveraged metrics (separate thresholds tuned for each label), even when evaluated in terms of macroaveraged metrics. A similar result has been obtained by Koyejo et al. (2015). It seems that tuning a single threshold shared among all labels can lead to a more stable solution that is less prone to overfitting, even though it is not the optimal thing to do for macroaveraged measures. We further report that, interestingly, SVM outperform LR in terms of MacroF on mediamill and this is the only case in which SVM get a better result than LR.
Summary
We present a theoretical analysis of a twostep approach to optimize classification performance metrics, which first learns a realvalued function f on a training sample by minimizing a surrogate loss, and then tunes the threshold on f by optimizing the target performance metric on a separate validation sample. We show that if the metric is a linearfractional function, and the surrogate loss is strongly proper composite, then the regret of the resulting classifier (obtained from thresholding realvalued f) measured with respect to the target metric is upperbounded by the regret of f measured with respect to the surrogate loss. The proof of our result goes by an intermediate bound of the regret with respect to the target measure by a costsensitive classification regret. As a byproduct, we get a bound on the costsensitive classification regret by a surrogate regret of a realvalued function which holds simultaneously for all misclassification costs. We also extend our results to cover multilabel classification and provide regret bounds for micro and macroaveraging measures. Our findings are backed up in a computational study on both synthetic and real data sets.
Notes
Throughout the paper, we follow the convention that all conditional quantities are lowercase (regret, risk), while all unconditional quantities are uppercase (Regret, Risk).
The fact that a single threshold is sufficient for consistency of microaveraged performance measures was already noticed by Koyejo et al. (2015).
Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear.
Datasets are taken from LibSVM repository: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.
Datasets are taken from LibSVM repository: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.
References
Agarwal, S. (2014). Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal of Machine Learning Research, 15, 1653–1674.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.
Dembczyński, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In ICML 2010 (pp. 279–286). Omnipress.
Dembczyński, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2012). On loss minimization and label dependence in multilabel classification. Machine Learning, 88, 5–45.
Dembczyński, K., Jachnik, A., Kotłowski, W., Waegeman, W., & Hüllermeier, E. (2013). Optimizing the fmeasure in multilabel classification: Plugin rule approach versus structured loss minimization. In ICML.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition (1st ed.). Berlin: Springer.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
Gao, W., & Zhou, Z. H. (2013). On the consistency of multilabel learning. Artificial Intelligence, 199–200, 22–44.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). Elements of statistical learning: Data mining, inference, and prediction. Berlin: Springer.
Jansche, M. (2005). Maximum expected Fmeasure training of logistic regression models. In HLT/EMNLP 2005 (pp. 736–743).
Jansche, M. (2007). A maximum expected utility framework for binary sequence labeling. In ACL 2007 (pp. 736–743).
Koyejo, O., Natarajan, N., Ravikumar, PK., & Dhillon, IS. (2014). Consistent binary classification with generalized performance metrics. In Neural information processing systems (NIPS).
Koyejo, O., Natarajan, N., Ravikumar, P., & Dhillon, IS. (2015). Consistent multilabel classification. In Neural information processing systems (NIPS).
Lewis, D. (1995). Evaluating and optimizing autonomous text classification systems. In SIGIR 1995 (pp. 246–254).
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Menon, A. K., Narasimhan, H., Agarwal, S., & Chawla, S. (2013). On the statistical consistency of algorithms for binary classification under class imbalance. In International conference on machine learning (ICML).
Musicant, D. R., Kumar, V., & Ozgur, A. (2003). Optimizing fmeasure with support vector machines. In FLAIRS conference (pp. 356–360)
Nan, Y., Chai, K. M. A., Lee, WS., & Chieu, H.L. (2012). Optimizing Fmeasure: A tale of two approaches. In International conference on machine learning (ICML).
Narasimhan, H., Vaish, R., & Agarwal, S. (2014). On the statistical consistency of plugin classifiers for nondecomposable performance measures. In Neural information processing systems (NIPS).
Narasimhan, H., Ramaswamym, H. G., Saha, A., & Agarwal, S. (2015). Consistent multiclass algorithms for complex performance measures. In International conference on machine learning (ICML).
Parambath, S. P., Usunier, N., & Grandvalet, Y. (2014). Optimizing Fmeasures by costsensitive classification. In Neural information processing systems (NIPS).
Petterson, J., & Caetano, T. S. (2010). Reverse multilabel learning. Advances in Neural Information Processing Systems, 24, 1912–1920.
Petterson, J., & Caetano, T. S. (2011). Submodular multilabel learning. Advances in Neural Information Processing Systems, 24, 1512–1520.
Reid, M. D., & Williamson, R. C. (2010). Composite binary losses. Journal of Machine Learning Research, 11, 2387–2422.
Reid, M. D., & Williamson, R. C. (2011). Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12, 731–817.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.
Waegeman, W., Dembczyński, K., Jachnik, A., Cheng, W., & Hüllermeier, E. (2013). On the Bayesoptimality of Fmeasure maximizers. Journal of Machine Learning Research, 15, 3513–3568.
Zhao, M. J., Edakunni, N., Pocock, A., & Brown, G. (2013). Beyond Fano’s inequality: Bounds on the optimal Fscore, BER, and costsensitive risk and their implications. Journal of Machine Learning Research, 14, 1033–1090.
Acknowledgments
W. Kotłowski has been supported by the Polish National Science Centre under Grant No. 2013/11/D/ST6/03050. K. Dembczyński has been supported by the Polish National Science Centre under Grant No. 2013/09/D/ST6/03917.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Geoff Holmes, TieYan Liu, Hang Li, Irwin King and ZhiHua Zhou.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Kotłowski, W., Dembczyński, K. Surrogate regret bounds for generalized classification performance metrics. Mach Learn 106, 549–572 (2017). https://doi.org/10.1007/s1099401655917
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401655917
Keywords
 Generalized performance metric
 Regret bound
 Surrogate loss function
 Binary classification
 Multilabel classification
 Fmeasure
 Jaccard similarity
 AM measure