Given a real-valued function \(f :X \rightarrow {\mathbb {R}}\), and a \(\lambda \)-strongly proper composite loss \(\ell (y,f)\), define the \(\ell \)-risk of f as the expected loss of f(x) with respect to the data distribution:
$$\begin{aligned} {\mathrm {Risk}}_\ell (f)&= {\mathbb {E}}_{(x,y)} \left[ \ell (y,f(x)) \right] \\&= {\mathbb {E}}_{x} \left[ {\mathrm {risk}}_\ell (\eta (x),f(x)) \right] , \end{aligned}$$
where \(\eta (x) = \Pr (y=1|x)\). Let \(f_{\ell }^*\) be the minimizer \({\mathrm {Risk}}_\ell (f)\) over all functions, \(f_{\ell }^* = {\text {argmin}}_f {\mathrm {Risk}}_\ell (f)\). Since \(\ell \) is proper composite:
$$\begin{aligned} f_{\ell }^*(x) = \psi \big (\eta (x)\big ). \end{aligned}$$
Define the \(\ell \)-regret of f as:
$$\begin{aligned} {\mathrm {Reg}}_{\ell }(f)&= {\mathrm {Risk}}_\ell (f) - {\mathrm {Risk}}_\ell (f^*_{\ell }) \\&= {\mathbb {E}}_{x} \left[ {\mathrm {risk}}_\ell (\eta (x),f(x)) - {\mathrm {risk}}_\ell (\eta (x), f^*_{\ell }(x)) \right] . \end{aligned}$$
Any real-valued function \(f :X \rightarrow {\mathbb {R}}\) can be turned into a classifier \(h_{f,\theta } :X \rightarrow \{-1,1\}\), by thresholding at some value \(\theta \):
$$\begin{aligned} h_{f,\theta }(x) = {\mathrm {sgn}}(f(x) - \theta ). \end{aligned}$$
The purpose of this paper is to address the following problem: given a function f with \(\ell \)-regret \({\mathrm {Reg}}_{\ell }(f)\), and a threshold \(\theta \), what can we say about \(\varPsi \)-regret of \(h_{f,\theta }\)? For instance, can we bound \({\mathrm {Reg}}_{\varPsi }(h_{f,\theta })\) in terms of \({\mathrm {Reg}}_{\ell }(f)\)? We give a positive answer to this question, which is based on the following regret bound:
Lemma 1
Let \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) be a linear-fractional function of the form (1), which is non-increasing in \({\mathrm {FP}}\) and \({\mathrm {FN}}\). Assume that there exists \(\gamma > 0\), such that for any classifier \(h :X \rightarrow \{-1,1\}\):
$$\begin{aligned} b_0 + b_1 {\mathrm {FP}}(h) + b_2 {\mathrm {FN}}(h) \ge \gamma , \end{aligned}$$
i.e., the denominator of \(\varPsi \) is positive and bounded away from zero. Let \(\ell \) be a \(\lambda \)-strongly proper composite loss function. Then, there exists a threshold \(\theta ^*\), such that for any real-valued function \(f :X \rightarrow {\mathbb {R}}\),
$$\begin{aligned} {\mathrm {Reg}}_{\varPsi }(h_{{f,\theta ^*}}) \le C \sqrt{\frac{2}{\lambda }} \sqrt{{\mathrm {Reg}}_{\ell }(f)}, \end{aligned}$$
where \(C = \frac{1}{\gamma }\left( \varPsi (h_{\varPsi }^*)(b_1 + b_2) - (a_1 + a_2)\right) > 0\).
The proof is quite long and hence is postponed to Sect. 4. Interestingly, the proof goes by an intermediate bound of the \(\varPsi \)-regret by a cost-sensitive classification regret. We note that the bound in Lemma 1 is in general unimprovable, in the sense that it is easy to find f, \(\varPsi \), \(\ell \), and distribution \(\Pr (x,y)\), for which the bound holds with equality (see proof for details). We split the constant in front of the bound into C and \(\lambda \), because C depends only on \(\varPsi \), while \(\lambda \) depends only on \(\ell \). Table 3 lists these constants for some popular metrics. We note that constant \(\gamma \) (lower bound on the denominator of \(\varPsi \)) will be distribution-dependent in general (as it can depend on \(P=\Pr (y=1)\)) and may not have a uniform lower bound which holds for all distributions.
Lemma 1 has the following interpretation. If we are able to find a function f with small \(\ell \)-regret, we are guaranteed that there exists a threshold \(\theta ^*\) such that \(h_{f,\theta ^*}\) has small \(\varPsi \)-regret. Note that the same threshold \(\theta ^*\) will work for any f, and the right hand side of the bound is independent of \(\theta ^*\). Hence, to minimize the right hand side we only need to minimize \(\ell \)-regret, and we can deal with the threshold afterwards.
Lemma 1 also reveals the form of the optimal classifier \(h_{\varPsi }^*\): take \(f=f^*_{\ell }\) in the lemma and note that \({\mathrm {Reg}}_{\ell }(f^*_{\ell })=0\), so that \({\mathrm {Reg}}_{\varPsi }(h_{f^*_{\ell },\theta ^*}) = 0\), which means that \(h_{f^*_{\ell },\theta ^*}\) is the minimizer of \(\varPsi \):
$$\begin{aligned} h^*_{\varPsi }(x) = {\mathrm {sgn}}(f^*_{\ell }(x) - \theta ^*) = {\mathrm {sgn}}(\eta (x) - \psi ^{-1}(\theta ^*)), \end{aligned}$$
where the second equality is due to \(f^*_{\ell } = \psi (\eta )\) and strict monotonicity of \(\psi \). Hence, \(h^*_{\varPsi }\) is a threshold function on \(\eta \). The proof of Lemma 1 (see Sect. 4) actually specifies the exact value of the threshold \(\theta ^*\):
$$\begin{aligned} \psi ^{-1}(\theta ^*) = \frac{\varPsi (h^*_{\varPsi })b_1 - a_1}{\varPsi (h^*_{\varPsi })(b_1+b_2) - (a_1+a_2)}, \end{aligned}$$
(3)
which is in agreement with the result obtained by Koyejo et al. (2014).Footnote 2
Table 3 Constants which appear in the bound of Lemma 1 for several performance metrics
To make Lemma 1 easier to grasp, consider a special case when the performance metric \(\varPsi ({\mathrm {FP}},{\mathrm {FN}}) = 1 - {\mathrm {FP}}- {\mathrm {FN}}\) is the classification accuracy. In this case, (3) gives \(\varPsi ^{-1}(\theta ^*) = 1/2\). Hence, we obtained the well-known result that the classifier maximizing the accuracy is a threshold function on \(\eta \) at 1 / 2. Then, Lemma 1 states that given a real-valued f, we should take a classifier \(h_{f,\theta ^*}\) which thresholds f at \(\theta ^* = \psi (1/2)\). Using Table 2, one can easily verify that \(\theta ^* = 0\) for logistic, squared-error and exponential loss. This agrees with the common approach of thresholding the real-valued classifiers trained by minimizing these losses at 0 to obtain the label prediction. The bounds from the lemma are in this case identical (up to a multiplicative constant) to the bounds obtained by Bartlett et al. (2006).
Unfortunately, for more complicated performance metrics, the optimal threshold \(\theta ^*\) is unknown, as (3) contains unknown quantity \(\varPsi (h^*_{\varPsi })\), the value of the metric at optimum. The solution in this case is to, given f, directly search for a threshold which maximizes \(\varPsi (h_{f,\theta })\). This is the main result of the paper:
Theorem 1
Given a real-valued function f, let \(\theta ^*_f = {\text {argmax}}_{\theta } \varPsi (h_{f,\theta })\). Then, under the assumptions and notation from Lemma 1:
$$\begin{aligned} {\mathrm {Reg}}_{\varPsi }(h_{f,\theta ^*_f}) \le C \sqrt{\frac{2}{\lambda }} \sqrt{{\mathrm {Reg}}_{\ell }(f)}. \end{aligned}$$
Proof
The result follows immediately from Lemma 1: Solving \(\max _{\theta }\varPsi (h_{f,\theta })\) is equivalent to solving \(\min _\theta {\mathrm {Reg}}_{\varPsi }(h_{f,\theta })\), and \(\min _\theta {\mathrm {Reg}}_{\varPsi }(h_{f,\theta }) \le {\mathrm {Reg}}_{\varPsi }(h_{f,\theta ^*})\), where \(\theta ^*\) is the threshold given by Lemma 1. \(\square \)
Theorem 1 motivates the following procedure for maximization of \(\varPsi \):
-
1.
Find f with small \(\ell \)-regret, e.g., by using a learning algorithm minimizing \(\ell \)-risk on the training sample.
-
2.
Given f, solve \(\theta ^*_f = {\text {argmax}}_{\theta } \varPsi (h_{f,\theta })\).
Theorem 1 states that the \(\varPsi \)-regret of the classifier obtained by this procedure is upperbounded by the \(\ell \)-regret of the underlying real-valued function.
We now discuss how to approach step 2 of the procedure in practice. In principle, this step requires maximizing \(\varPsi \) defined through \({\mathrm {FP}}\) and \({\mathrm {FN}}\), which are expectations over an unknown distribution \(\Pr (x,y)\). However, it is sufficient to optimize \(\theta \) on the empirical counterpart of \(\varPsi \) calculated on a separate validation sample. Let \(\mathcal {T} = \{(x_i,y_i)\}_{i=1}^n\) be the validation set of size n. Define:
$$\begin{aligned} \widehat{{\mathrm {FP}}}(h) = \frac{1}{n} \sum _{i=1}^n \llbracket h(x_i) = 1, y_i = -1 \rrbracket , \quad \widehat{{\mathrm {FN}}}(h) = \frac{1}{n} \sum _{i=1}^n \llbracket h(x_i) = -1, y_i = 1 \rrbracket , \end{aligned}$$
the empirical counterparts of \({\mathrm {FP}}\) and \({\mathrm {FN}}\), and let \(\widehat{\varPsi }(h) = \varPsi (\widehat{{\mathrm {FP}}}(h),\widehat{{\mathrm {FN}}}(h))\) be the empirical counterpart of the performance metric \(\varPsi \). We now replace step 2 by:
Given f and validation sample \(\mathcal {T}\), solve \(\widehat{\theta }_f = {\text {argmax}}_{\theta } \widehat{\varPsi }(h_{f,\theta })\).
In Theorem 2 below, we show that:
$$\begin{aligned} {\mathrm {Reg}}_{\varPsi }(h_{f,\widehat{\theta }_f}) - {\mathrm {Reg}}_{\varPsi }(h_{f,\theta ^*_f}) = O\left( \sqrt{\frac{\log n}{n}}\right) , \end{aligned}$$
so that tuning the threshold on the validation sample of size n (which results in \(\widehat{\theta }_f\)) instead of on the population level (which results in \(\theta ^*_f\)) will cost at most \(O\Big (\sqrt{\frac{\log n}{n}}\Big )\) additional regret. The main idea of the proof is that finding the optimal threshold comes down to optimizing within a class of \(\{-1,1\}\)-valued threshold functions, which has small Vapnik–Chervonenkis dimension. This, together with the fact that under assumptions from Lemma 1, \(\varPsi \) is stable with respect to its arguments, implies that \(\varPsi (h_{f,\widehat{\theta }_f})\) is close to \(\varPsi (h_{f,\theta ^*_f})\).
Theorem 2
Let the assumptions from Lemma 1 hold, and let:
$$\begin{aligned} D_1 = \sup _{({\mathrm {FP}},{\mathrm {FN}})} |b_1 \varPsi ({\mathrm {FP}},{\mathrm {FN}}) - a_1|, \quad D_2 = \sup _{({\mathrm {FP}},{\mathrm {FN}})} |b_2 \varPsi ({\mathrm {FP}},{\mathrm {FN}}) - a_2|, \end{aligned}$$
and \(D = \max \{D_1,D_2\}\). Given a real-valued function f, and a validation set \(\mathcal {T}\) of size n generated i.i.d. from P(x, y), let \(\widehat{\theta }_f = {\text {argmax}}_{\theta } \widehat{\varPsi } (h_{f,\theta })\) be the threshold maximizing the empirical counterpart of \(\varPsi \) evaluated on \(\mathcal {T}\). Then, with probability \(1-\delta \) (over the random choice of \(\mathcal {T}\)):
$$\begin{aligned} {\mathrm {Reg}}_{\varPsi }(h_{f,\widehat{\theta }_f}) \le C \sqrt{\frac{2}{\lambda }} \sqrt{{\mathrm {Reg}}_{\ell }(f)} + \frac{16D}{\gamma } \sqrt{\frac{4(1+\log n) + 2 \log \frac{16}{\delta }}{n}}. \end{aligned}$$
Proof
For any \({\mathrm {FP}}\) and \({\mathrm {FN}}\), we have:
$$\begin{aligned} \left| \frac{\partial \varPsi ({\mathrm {FP}},{\mathrm {FN}})}{\partial {\mathrm {FP}}} \right|&= \frac{|a_1 (b_0 + b_1 {\mathrm {FP}}+ b_2 {\mathrm {FN}}) - b_1 (a_0 + a_1 {\mathrm {FP}}+ a_2 {\mathrm {FN}})|}{(b_0 + b_1 {\mathrm {FP}}+ b_2 {\mathrm {FN}})^2} \\&= \frac{|b_1 \varPsi ({\mathrm {FP}},{\mathrm {FN}}) - a_1|}{b_0 + b_1 {\mathrm {FP}}+ b_2 {\mathrm {FN}}} \le \frac{|b_1 \varPsi ({\mathrm {FP}},{\mathrm {FN}}) - a_1|}{\gamma } \le \frac{D}{\gamma }, \end{aligned}$$
and similarly,
$$\begin{aligned} \left| \frac{\partial \varPsi ({\mathrm {FP}},{\mathrm {FN}})}{\partial {\mathrm {FN}}}\right| = \frac{|b_2 \varPsi ({\mathrm {FP}},{\mathrm {FN}}) - a_2|}{b_0 + b_1 {\mathrm {FP}}+ b_2 {\mathrm {FN}}} \le \frac{D}{\gamma }. \end{aligned}$$
For any \(({\mathrm {FP}},{\mathrm {FN}})\) and \(({\mathrm {FP}}',{\mathrm {FN}}')\), Taylor-expanding \(\varPsi ({\mathrm {FP}},{\mathrm {FN}})\) around \(({\mathrm {FP}}',{\mathrm {FN}}')\) up to the first order and using the bounds above gives:
$$\begin{aligned} \varPsi ({\mathrm {FP}},{\mathrm {FN}}) \le \varPsi ({\mathrm {FP}}',{\mathrm {FN}}') + \frac{D}{\gamma } \left( |{\mathrm {FP}}- {\mathrm {FP}}'| + |{\mathrm {FN}}- {\mathrm {FN}}'| \right) . \end{aligned}$$
(4)
Now, we have:
$$\begin{aligned} {\mathrm {Reg}}_{\varPsi }(h_{f,\widehat{\theta }_f})&= {\mathrm {Reg}}_{\varPsi }(h_{f,\theta ^*_f}) + \varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f}) \\&\le C \sqrt{\frac{2}{\lambda }} \sqrt{{\mathrm {Reg}}_{\ell }(f)} + \varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f}), \end{aligned}$$
where we used Theorem 1. Thus, it amounts to bound \(\varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f})\). From the definition of \(\widehat{\theta }_f\), \(\widehat{\varPsi }(h_{f,\widehat{\theta }_f}) \ge \widehat{\varPsi }(h_{f,\theta ^*_f})\), hence:
$$\begin{aligned} \varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f})&\le \varPsi (h_{f,\theta ^*_f}) - \widehat{\varPsi }(h_{f,\theta ^*_f}) + \widehat{\varPsi }(h_{f,\widehat{\theta }_f}) - \varPsi (h_{f,\widehat{\theta }_f}) \\&\le 2 \sup _{\theta } \big |\varPsi (h_{f,\theta }) - \widehat{\varPsi }(h_{f,\theta })\big | \\&= 2 \sup _{\theta } \big |\varPsi ({\mathrm {FP}}(h_{f,\theta }),{\mathrm {FN}}(h_{f,\theta })) - \varPsi (\widehat{{\mathrm {FP}}}(h_{f,\theta }),\widehat{{\mathrm {FN}}}(h_{f,\theta }))\big |, \end{aligned}$$
where we used the definition of \(\widehat{\varPsi }\). Using (4),
$$\begin{aligned} \varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f}) \le \frac{2D}{\gamma } \Big (\sup _{\theta }\big |{\mathrm {FP}}(h_{f,\theta }) - \widehat{{\mathrm {FP}}}(h_{f,\theta })\big | + \sup _{\theta }\big |{\mathrm {FN}}(h_{f,\theta }) - \widehat{{\mathrm {FN}}}(h_{f,\theta })\big | \Big ). \end{aligned}$$
Note that the suprema above are on the deviation of empirical mean from the expectation over the class of threshold functions, which has Vapnik–Chervonenkis dimension equal to 2. Using standard argument from Vapnik–Chervonenkis theory (see, e.g., Devroye et al. 1996), with probability \(1 - \frac{\delta }{2}\) over the random choice of \(\mathcal {T}\):
$$\begin{aligned} \sup _{\theta }\big |{\mathrm {FP}}(h_{f,\theta }) - \widehat{{\mathrm {FP}}}(h_{f,\theta })\big | \le 4\sqrt{\frac{4(1+\log n) + 2 \log \frac{16}{\delta }}{n}}, \end{aligned}$$
and similarly for the second supremum. Thus, with probability \(1-\delta \),
$$\begin{aligned} \varPsi (h_{f,\theta ^*_f}) - \varPsi (h_{f,\widehat{\theta }_f}) \le \frac{16D}{\gamma } \sqrt{\frac{4(1+\log n) + 2 \log \frac{16}{\delta }}{n}}, \end{aligned}$$
which finishes the proof. \(\square \)
We note that, contrary to a similar results by Koyejo et al. (2014), Theorem 2 does not require continuity of the cumulative distribution of \(\eta (x)\) around \(\theta ^*\).