Skip to main content
Log in

Optimal Exponential Bounds on the Accuracy of Classification

  • Published:
Constructive Approximation Aims and scope

Abstract

Consider a standard binary classification problem, in which \((X,Y)\) is a random couple in \(\mathcal{X}\times \{0,1\}\), and the training data consist of \(n\) i.i.d. copies of \((X,Y).\) Given a binary classifier \(f:\mathcal{X}\mapsto \{0,1\},\) the generalization error of \(f\) is defined by \(R(f)={\mathbb P}\{Y\ne f(X)\}\). Its minimum \(R^*\) over all binary classifiers \(f\) is called the Bayes risk and is attained at a Bayes classifier. The performance of any binary classifier \(\hat{f}_n\) based on the training data is characterized by the excess risk \(R(\hat{f}_n)-R^*\). We study Bahadur’s type exponential bounds on the following minimax accuracy confidence function based on the excess risk:

where the supremum is taken over all distributions \(P\) of \((X,Y)\) from a given class of distributions \(\mathcal{M}\) and the infimum is over all binary classifiers \(\hat{f}_n\) based on the training data. We study how this quantity depends on the complexity of the class of distributions \(\mathcal{M}\) characterized by exponents of entropies of the class of regression functions or of the class of Bayes classifiers corresponding to the distributions from \(\mathcal{M}.\) We also study its dependence on margin parameters of the classification problem. In particular, we show that, in the case when \(\mathcal{X}=[0,1]^d\) and \(\mathcal{M}\) is the class of all distributions satisfying the margin condition with exponent \(\alpha >0\) and such that the regression function \(\eta \) belongs to a given Hölder class of smoothness \(\beta >0,\)

for some constants \(D,\lambda _0>0\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Audibert, J.-Y., Tsybakov, A.B.: Fast learning rates for plug-in classifiers. Ann. Stat. 35, 608–633 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  2. Blanchard, G., Lugosi, G., Vayatis, N.: On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4, 861–894 (2003)

    MathSciNet  Google Scholar 

  3. Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification and risk bounds. J. Am. Stat. Assoc. 101, 138–156 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  4. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York), vol. 31. Springer-Verlag, New York (1996)

    MATH  Google Scholar 

  5. DeVore, R., Kerkyacharian, G., Picard, D., Temlyakov, V.: Approximation methods for supervised learning. Found. Comput. Math. 6, 3–58 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  6. Dudley, R.: Uniform Central Limit Theorems. Cambridge University Press, Cambridge (1999)

    Book  MATH  Google Scholar 

  7. Ibragimov, I.A., Hasminskii, R.Z.: Statistical Estimation: Asymptotic Theory. Springer, New York (1981)

    Book  MATH  Google Scholar 

  8. Kolmogorov, A.N., Tikhomorov, V.M.: \(\epsilon \)-entropy and \(\epsilon \)-capacity of sets in function spaces. Trans. Am. Math. Soc. 17, 277–364 (1961)

    Google Scholar 

  9. Koltchinskii, V.: Local Rademacher complexities and Oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  10. Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse recovery problems. Ecole d’été de Probabilités de Saint-Flour 2008. Lecture Notes in Mathematics. Springer, New York (2011)

  11. Massart, P.: Concentration inequalities and model selection. Ecole d’été de Probabilités de Saint-Flour. Lecture Notes in Mathematics. Springer, New York (2007)

  12. Massart, P., Nédélec, É.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)

    Article  MATH  Google Scholar 

  13. Pentacaput, N.I.: Optimal exponential bounds on the accuracy of classification. arXiv:1111.6160 (2011)

  14. Steinwart, I., Scovel, J.C.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35, 575–607 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  15. Temlyakov, V.N.: Approximation in learning theory. Constr. Approx. 27(1), 33–74 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  16. Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  17. Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer, New York (2009)

    Book  MATH  Google Scholar 

  18. Tsybakov, A.B., van de Geer, S.: Square root penalty: adaptation to the margin in classification and in edge estimation. Ann. Stat. 33(3), 1203–1224 (2005)

    Article  MATH  Google Scholar 

  19. van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes, With Applications to Statistics. Springer-Verlag, New York (1996)

    Book  MATH  Google Scholar 

  20. Yang, Y.: Minimax nonparametric classification—part I: rates of convergence. IEEE Trans. Inform. Theory 45, 2271–2284 (1999)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. B. Tsybakov.

Additional information

Communicated by Albert Cohen.

Appendix

Appendix

Proof of Theorems 2 and 3. We deduce Theorems 2 and 3 from the following fact that we state here as a proposition.

Proposition 4

Let either \(0<\alpha <\infty \) and \(\varkappa =\frac{1+\alpha }{\alpha }\) or \(\alpha =\infty \) and \(\varkappa =1\). Then, there exists a constant \(C_*>0\) such that, for all \(t>0,\)

$$\begin{aligned} \sup _{P\in {\mathcal {M}}^*(\rho , \alpha )}{\mathbb P}\biggl \{ R(\hat{f}_{n,2})-R(f^*_P)\ge C_* \biggl [n^{-\frac{\varkappa }{2\varkappa -1+\rho }}\vee \biggl (\frac{t}{n}\biggr )^{\frac{\varkappa }{2\varkappa -1}} \biggr ] \biggr \}\le e^{1-t}. \end{aligned}$$

It is easy to see that Theorem 2 follows from this proposition by taking \(t=c n \lambda ^{\frac{2+\alpha }{1+\alpha }}\) with \(\lambda \ge c' n^{-\frac{1+\alpha }{2+\alpha (1+\rho )}}\) for some constants \(c,c'>0,\) and using that \(\varkappa =\frac{1+\alpha }{\alpha }\). To obtain Theorem 3, we take \(t=c n \lambda \) with \(\lambda \ge c' n^{-\frac{1}{1+\rho }}\).

Proposition 4 will be derived from a general excess risk bound in abstract empirical risk minimization ([10], Theorem 4.3). We will state this result here for completeness. To this end, we need to introduce some notation. Let \(\mathcal{G}\) be a class of measurable functions from a probability space \((S,\mathcal{A}_S, P)\) into \([0,1]\), and let \(Z_1,\dots , Z_n\) be i.i.d. copies of an observation \(Z\) sampled from \(P.\) For any probability measure \(P\) and any \(g\in \mathcal{G}\), introduce the following notation for the expectation:

$$\begin{aligned} Pg=\int \limits _S gdP. \end{aligned}$$

Denote by \(P_n\) the empirical measure based on \((Z_1,\dots , Z_n)\), and consider the minimizer of the empirical risk

$$\begin{aligned} \hat{g}_n \triangleq \mathrm{argmin}_{g\in \mathcal{G}}P_n g. \end{aligned}$$

For a function \(g\in \mathcal{G},\) define the excess risk

$$\begin{aligned} \mathcal{E}_P(g)\triangleq Pg-\inf \limits _{g'\in \mathcal{G}}Pg'. \end{aligned}$$

The set

$$\begin{aligned} \mathcal{F}_P(\delta )\triangleq \{g\in \mathcal{G}: \, \mathcal{E}_P(g)\le \delta \} \end{aligned}$$

is called the \(\delta \)-minimal set. The size of such a set will be controlled in terms of its \(L_2(P)\)-diameter

$$\begin{aligned} D(\delta )\triangleq \sup _{g,g'\in \mathcal{F}_P(\delta )}\Vert g-g'\Vert _{L_2(P)} \end{aligned}$$

and also in terms of the following “localized empirical complexity”:

$$\begin{aligned} \phi _n(\delta )\triangleq {\mathbb E}\sup _{g,g'\in \mathcal{F}_P(\delta )}|(P_n-P)(g-g')|. \end{aligned}$$

We will use these complexity measures to construct an upper confidence bound on the excess risk \(\mathcal{E}_P(\hat{f}_{n,2}).\) For a function \(\psi :{\mathbb R}_+\mapsto {\mathbb R}_+,\) define

$$\begin{aligned} \psi ^{\flat }(\delta )\triangleq \sup _{\sigma \ge \delta }\frac{\psi (\sigma )}{\sigma }. \end{aligned}$$

Let

$$\begin{aligned} V_n^t(\delta )\triangleq 4\biggl [\phi _n^{\flat }(\delta )+\sqrt{(D^2)^{\flat }(\delta )\frac{t}{n\delta }}+ \frac{t}{n\delta }\biggr ],\ \delta >0, t>0, \end{aligned}$$

and define

$$\begin{aligned} \sigma _n^t\triangleq \inf \{\sigma : V_n^t(\sigma )\le 1\}. \end{aligned}$$

The following result is the first bound of Theorem 4.3 in [10].

Proposition 5

For all \(t>0,\)

$$\begin{aligned} {\mathbb P}\Big \{\mathcal{E}_P(\hat{f}_{n,2})>\sigma _n^t\Big \}\le e^{1-t}. \end{aligned}$$

In addition to this, we will use the well-known inequality for the expected sup-norm of the empirical process in terms of bracketing entropy, see Theorem 2.14.2 in [19]. More precisely, we will need the following simplified version of that result.

Lemma 2

Let \(\mathcal{T}\) be a class of functions from \(S\) into \([0,1]\) such that \(\Vert g\Vert _{L_2(P)}\le a\) for all \(g\in \mathcal{T}.\) Assume that \(H_{[\ ]}(a, \mathcal{T},\Vert \cdot \Vert _{L_2(P)})+1\le a^2 n\). Then,

$$\begin{aligned} {\mathbb E}\sup _{g\in \mathcal{T}}|P_n g -P g|\le \frac{\bar{C}}{\sqrt{n}} \int \limits _0^{a} \left( H_{[\ ]}({\varepsilon }, \mathcal{T},\Vert \cdot \Vert _{L_2(P)})+1\right) ^{1/2} d{\varepsilon }, \end{aligned}$$

where \(\bar{C}>0\) is a universal constant.

Proof of Proposition 4

Note that if \(t>n,\) then \((\frac{t}{n})^{\varkappa /(2\varkappa -1)}> 1,\) and the result holds trivially with \(C_*=1\) since \(R(\hat{f}_{n,2})-R(f^*_P)\le 1.\) Thus, it is enough to consider the case \(t\le n.\)

Let \(S=\mathcal{X}\times \{0,1\}\) and \(P\) be the distribution of \(Z=(X,Y)\). We will apply Proposition 5 to the class \(\mathcal{G}\triangleq \{g_f: \, g_f(x,y)=I_{\{y \ne f(x)\}}, \ f\in \mathcal{F}\}\). Then, clearly, \(Pg_f=R(f)\) and \(\mathcal{E}_P (g_f)= R(f)-R(f^*_P)\) for \(g_f(x,y)=I_{\{y \ne f(x)\}},\) which implies that

$$\begin{aligned} \mathcal{F}_P(\delta )=\{g_f: f\in \mathcal{F},\ R(f)-R(f^*_P)\le \delta \}. \end{aligned}$$

We also have \(\Vert g_{f_1}-g_{f_2}\Vert _{L_2(P)}^2=\Vert f_1-f_2\Vert _{L_1(\mu _X)}.\) Thus, it follows from Lemma 1 that, for all \(g_f\in \mathcal{G}\),

$$\begin{aligned} \mathcal{E}_P (g_f)\ge c_M \Vert g_f-g_{f_{P}^{*}}\Vert _{L_2(P)}^{2\varkappa }, \end{aligned}$$

and we get a bound on the \(L_2(P)\)-diameter of the \(\delta \)-minimal set \(\mathcal{F}_P(\delta ):\) with some constant \({\bar{c}}_1>0\),

$$\begin{aligned}&D(\delta )\le {\bar{c}}_1 \delta ^{1/(2\varkappa )}. \end{aligned}$$
(22)

To bound the function \(\phi _n(\delta ),\) we will apply Lemma 2 to the class \(\mathcal{T}=\mathcal{F}_P(\delta )\) with \(a=1\). Note that

$$\begin{aligned} H_{[\ ]}({\varepsilon }, \mathcal{F}_P(\delta ),\Vert \cdot \Vert _{L_2(P)})&\le 2H_{[\ ]}({\varepsilon }/2,\mathcal{G},\Vert \cdot \Vert _{L_2(P)})\\&\le 2 H_{[\ ]}({\varepsilon }^2/4, \mathcal{F},\Vert \cdot \Vert _{L_1(\mu _X)})\\&\le 2 H_{[\ ]}({\varepsilon }^2/(4c_\mu ), \mathcal{F},\Vert \cdot \Vert _{L_1(\mu )}) . \end{aligned}$$

Using (17), we easily get from Lemma 2 that, with some constants \({\bar{c}}_2, {\bar{c}}_3>0\),

$$\begin{aligned} \phi _n(\delta )\le {\bar{c}}_2 \delta ^{\frac{1-\rho }{2\varkappa }} n^{-1/2},\ \ \delta \ge {\bar{c}}_3 n^{-\frac{\varkappa }{1+\rho }}, \end{aligned}$$

which implies that, with some constant \({\bar{c}}_4>0\),

$$\begin{aligned} \phi _n(\delta )\le {\bar{c}}_4 \max \Big (\delta ^{\frac{1-\rho }{2\varkappa }} n^{-1/2}, n^{-\frac{1}{1+\rho }}\Big ),\delta >0. \end{aligned}$$

This and (22) lead to the following bound on the function \(V_n^t(\delta )\):

$$\begin{aligned} V_n^t(\delta )\le {\bar{c}}_5 \biggl [\delta ^{\frac{1-\rho }{2\varkappa }-1}n^{-1/2}\vee \delta ^{-1}n^{-\frac{1}{1+\rho }}+\delta ^{\frac{1}{2\varkappa }-1} \sqrt{\frac{t}{n}}+ \delta ^{-1}\frac{t}{n}\biggr ] \end{aligned}$$

that holds with some constant \({\bar{c}}_5.\) Thus, we end up with a bound on \(\sigma _n^t:\)

$$\begin{aligned} \sigma _n^{t}\le {\bar{c}}_6 \biggl [n^{-\frac{\varkappa }{2\varkappa -1+\rho }}\vee n^{-\frac{1}{1+\rho }}\vee \biggl (\frac{t}{n}\biggr )^{\varkappa /(2\varkappa -1)} \vee \frac{t}{n} \biggr ]. \end{aligned}$$
(23)

Note that, for \(\varkappa \ge 1,\;\rho < 1\), and \(t\le n,\) we have

$$\begin{aligned} n^{-\varkappa /(2\varkappa -1+\rho )}\ge n^{-1/(1+\rho )}\ \ \mathrm{and}\ \ \left( \frac{t}{n}\right) ^{\varkappa /(2\varkappa -1)} \ge \frac{t}{n}. \end{aligned}$$

Therefore, (23) can be simplified as follows:

$$\begin{aligned} \sigma _n^{t}\le {\bar{c}}_7 \biggl [n^{-\frac{\varkappa }{2\varkappa -1+\rho }}+ \biggl (\frac{t}{n}\biggr )^{\varkappa /(2\varkappa -1)} \biggr ], \end{aligned}$$

and the result immediately follows from Proposition 5.\(\square \)

1.1 Tools for the Minimax Lower Bounds

For two probability measures \(\mu \) and \(\nu \) on a measurable space \(({\mathcal {X}}, {\mathcal A})\), we define the Kullback-Leibler divergence and the \(\chi ^2\)-divergence as follows:

$$\begin{aligned} \mathcal {K}(\mu ,\nu ) \triangleq \int \limits _{{\mathcal {X}}} g\ln g d\nu , \quad \chi ^2(\mu ,\nu ) \triangleq \int \limits _{{\mathcal {X}}} (g-1)^2 d\nu , \end{aligned}$$

if \(\mu \) is absolutely continuous with respect to \(\nu \) with Radon-Nikodym derivative \(g=\frac{d\mu }{d\nu },\) and we set \(\mathcal {K}(\mu ,\nu )\triangleq +\infty \), \(\chi ^2(\mu ,\nu )\triangleq +\infty \) otherwise.

We will use the following auxiliary result.

Lemma 3

Let \(({\mathcal {X}}, {\mathcal A})\) be a measurable space, and let \(A_i \in {\mathcal A},\; i\in \{ 0,1,\dots ,M\}, M\ge 2\), be such that \(\forall i\ne j,\; A_i\cap A_j =\emptyset .\) Assume that \(Q_i\), \(i\in \{0,1\dots ,M\}\), are probability measures on \(({\mathcal {X}}, {\mathcal A})\) such that

$$\begin{aligned} \frac{1}{M}\sum _{j=1}^M \mathcal {K}(Q_j,Q_0) \le \chi <\infty . \end{aligned}$$

Then,

$$\begin{aligned} p_*\triangleq \max _{0\le i \le M} Q_{i}({\mathcal {X}}{\setminus } A_i) \ge \frac{1}{12}\min \{1, \, M e^{-3\chi }\}. \end{aligned}$$

Proof

Proposition 2.3 in [17] yields

$$\begin{aligned} p_*\ge \sup _{0<\tau <1}\frac{\tau M}{\tau M +1} \left( 1+\frac{\chi + \sqrt{\chi /2}}{\log \tau }\right) . \end{aligned}$$

In particular, taking \(\tau ^*=\min (M^{-1}, e^{-3\chi })\) and using that \(\sqrt{6\log M} \ge 2\) for \(M\ge 2\), we obtain

$$\begin{aligned} p_*\ge \frac{\tau ^* M}{\tau ^* M +1}\left( 1+\frac{\chi + \sqrt{\chi /2}}{\log \tau ^*}\right) \ge \frac{1}{12}\min \{1, \, M e^{-3\chi }\}. \end{aligned}$$

\(\square \)

We now prove a classification setting analog of the lower bound obtained by DeVore et al. [5] in the regression problem.

Theorem 5

Assume that a class \(\Theta \) of probability distributions \(P\) with the corresponding regression functions \(\eta _P\) and Bayes rules \(f^*_{P}\) (as defined above), contains a set \(\{{P_i}\}_{i=1}^N \subset \Theta ,\; N\ge 3\), with the following properties: the marginal distribution of \(X\) is \(\mu _X\) for all \(P_i\), independently of \(i\), where \(\mu _X\) is an arbitrary probability measure, \(1/4\le \eta _{P_i}\le 3/4,\; i=1,\dots ,N\), and for any \(i\ne j\),

$$\begin{aligned} \Vert \eta _{P_i}-\eta _{P_j}\Vert _{L_2(\mu _X)}\le \gamma , \end{aligned}$$
(24)
$$\begin{aligned} \Vert f^*_{P_i}-f^*_{P_j}\Vert _{L_1(\mu _X)}\ge s \end{aligned}$$
(25)

with some \(\gamma >0,\; s>0\). Then, for any classifier \(\hat{f}_n\), we have

$$\begin{aligned} \max _{1\le k \le N}\mathbb {P}_k\{\Vert \hat{f}_n-f^*_{P_k}\Vert _{L_1(\mu _X)}\ge s/{2}\} \ge \frac{1}{12}\min \big (1, \, (N-1) \exp \{-24 n\gamma ^2\}\big ), \end{aligned}$$

where \(\mathbb {P}_k\) denotes the product probability measure associated to the i.i.d. \(n\)-sample from \(P_k\).

Proof

We apply Lemma 3, where we set \(Q_i=\mathbb {P}_i\), \(M=N-1\), and define the random events \(A_i\) as follows:

$$\begin{aligned} A_i\triangleq \{{\mathcal {D}}_n:\Vert \hat{f}_n-f^*_{P_i}\Vert _{L_1(\mu _X)}<s/2\},\quad i=1,\dots , N. \end{aligned}$$

The events \(A_i\) are disjoint because of (25). Thus, the theorem follows from Lemma  3 if we prove that \( \mathcal {K}(\mathbb {P}_i,\mathbb {P}_j)\le 8n\gamma ^2\) for all \(i,j\).

Let us evaluate \( \mathcal {K}(\mathbb {P}_i,\mathbb {P}_j)\). For each \(\eta _{P_i}\), the corresponding measure \(P_i\) is determined as follows:

$$\begin{aligned} dP_i(x,y)\triangleq (\eta _{P_i}(x)d\delta _{1}(y)+ (1-\eta _{P_i}(x))d\delta _{0}(y))d\mu _X(x), \end{aligned}$$

where \(d\delta _\xi \) denotes the Dirac measure with unit mass at \(\xi \). Set for brevity \(\eta _i\triangleq \eta _{P_i}\). Fix \(i\) and \(j\). We have \(dP_i(x,y)= g(x,y)dP_j(x,y)\), where

$$\begin{aligned} g(x,1)= \frac{\eta _i(x)}{\eta _j(x)},\quad g(x,0)=\frac{1-\eta _i(x)}{1-\eta _j(x)}. \end{aligned}$$

Therefore, using the inequalities \(1/4\le \eta _{i}, \eta _j\le 3/4\) and (24), we find

$$\begin{aligned} \chi ^2({P}_i, {P}_j)&= \int \left\{ \frac{(\eta _i(x)-\eta _j(x))^2}{\eta _j(x)}+ \frac{(\eta _i(x)-\eta _j(x))^2}{1-\eta _j(x)}\right\} d\mu _X(x)\nonumber \\&\le 8\Vert \eta _i-\eta _j\Vert _{L_2(\mu _X)}^2 \le 8\gamma ^2. \end{aligned}$$
(26)

Together with inequality between the Kullback and \(\chi ^2\)-divergences, cf. [17], p. 90, this yields

$$\begin{aligned} \mathcal {K}(\mathbb {P}_i,\mathbb {P}_j) = n\mathcal {K}({P}_i, {P}_j) \le n\chi ^2({P}_i,{P}_j) \le 8n\gamma ^2. \end{aligned}$$

\(\square \)

Comment. The preprint version of this paper was posted on the Arxiv under the pseudonym N.I. Pentacaput [13]. Then the paper was submitted to “Constructive Approximation” and was accepted for publication under this pseudonym. However, it turns out that because of the Publisher rules no paper can be published under a pseudonym. As a result, we publish it under our real names that we have chosen to arrange in a random order.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kerkyacharian, G., Tsybakov, A.B., Temlyakov, V. et al. Optimal Exponential Bounds on the Accuracy of Classification. Constr Approx 39, 421–444 (2014). https://doi.org/10.1007/s00365-014-9229-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00365-014-9229-3

Keywords

Mathematics Subject Classification

Navigation