Abstract
Consider a standard binary classification problem, in which \((X,Y)\) is a random couple in \(\mathcal{X}\times \{0,1\}\), and the training data consist of \(n\) i.i.d. copies of \((X,Y).\) Given a binary classifier \(f:\mathcal{X}\mapsto \{0,1\},\) the generalization error of \(f\) is defined by \(R(f)={\mathbb P}\{Y\ne f(X)\}\). Its minimum \(R^*\) over all binary classifiers \(f\) is called the Bayes risk and is attained at a Bayes classifier. The performance of any binary classifier \(\hat{f}_n\) based on the training data is characterized by the excess risk \(R(\hat{f}_n)-R^*\). We study Bahadur’s type exponential bounds on the following minimax accuracy confidence function based on the excess risk:
where the supremum is taken over all distributions \(P\) of \((X,Y)\) from a given class of distributions \(\mathcal{M}\) and the infimum is over all binary classifiers \(\hat{f}_n\) based on the training data. We study how this quantity depends on the complexity of the class of distributions \(\mathcal{M}\) characterized by exponents of entropies of the class of regression functions or of the class of Bayes classifiers corresponding to the distributions from \(\mathcal{M}.\) We also study its dependence on margin parameters of the classification problem. In particular, we show that, in the case when \(\mathcal{X}=[0,1]^d\) and \(\mathcal{M}\) is the class of all distributions satisfying the margin condition with exponent \(\alpha >0\) and such that the regression function \(\eta \) belongs to a given Hölder class of smoothness \(\beta >0,\)
for some constants \(D,\lambda _0>0\).
Similar content being viewed by others
References
Audibert, J.-Y., Tsybakov, A.B.: Fast learning rates for plug-in classifiers. Ann. Stat. 35, 608–633 (2007)
Blanchard, G., Lugosi, G., Vayatis, N.: On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4, 861–894 (2003)
Bartlett, P.L., Jordan, M.I., McAuliffe, J.D.: Convexity, classification and risk bounds. J. Am. Stat. Assoc. 101, 138–156 (2006)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York), vol. 31. Springer-Verlag, New York (1996)
DeVore, R., Kerkyacharian, G., Picard, D., Temlyakov, V.: Approximation methods for supervised learning. Found. Comput. Math. 6, 3–58 (2006)
Dudley, R.: Uniform Central Limit Theorems. Cambridge University Press, Cambridge (1999)
Ibragimov, I.A., Hasminskii, R.Z.: Statistical Estimation: Asymptotic Theory. Springer, New York (1981)
Kolmogorov, A.N., Tikhomorov, V.M.: \(\epsilon \)-entropy and \(\epsilon \)-capacity of sets in function spaces. Trans. Am. Math. Soc. 17, 277–364 (1961)
Koltchinskii, V.: Local Rademacher complexities and Oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Koltchinskii, V.: Oracle inequalities in empirical risk minimization and sparse recovery problems. Ecole d’été de Probabilités de Saint-Flour 2008. Lecture Notes in Mathematics. Springer, New York (2011)
Massart, P.: Concentration inequalities and model selection. Ecole d’été de Probabilités de Saint-Flour. Lecture Notes in Mathematics. Springer, New York (2007)
Massart, P., Nédélec, É.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)
Pentacaput, N.I.: Optimal exponential bounds on the accuracy of classification. arXiv:1111.6160 (2011)
Steinwart, I., Scovel, J.C.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35, 575–607 (2007)
Temlyakov, V.N.: Approximation in learning theory. Constr. Approx. 27(1), 33–74 (2008)
Tsybakov, A.B.: Optimal aggregation of classifiers in statistical learning. Ann. Stat. 32(1), 135–166 (2004)
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer, New York (2009)
Tsybakov, A.B., van de Geer, S.: Square root penalty: adaptation to the margin in classification and in edge estimation. Ann. Stat. 33(3), 1203–1224 (2005)
van der Vaart, A., Wellner, J.: Weak Convergence and Empirical Processes, With Applications to Statistics. Springer-Verlag, New York (1996)
Yang, Y.: Minimax nonparametric classification—part I: rates of convergence. IEEE Trans. Inform. Theory 45, 2271–2284 (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Albert Cohen.
Appendix
Appendix
Proof of Theorems 2 and 3. We deduce Theorems 2 and 3 from the following fact that we state here as a proposition.
Proposition 4
Let either \(0<\alpha <\infty \) and \(\varkappa =\frac{1+\alpha }{\alpha }\) or \(\alpha =\infty \) and \(\varkappa =1\). Then, there exists a constant \(C_*>0\) such that, for all \(t>0,\)
It is easy to see that Theorem 2 follows from this proposition by taking \(t=c n \lambda ^{\frac{2+\alpha }{1+\alpha }}\) with \(\lambda \ge c' n^{-\frac{1+\alpha }{2+\alpha (1+\rho )}}\) for some constants \(c,c'>0,\) and using that \(\varkappa =\frac{1+\alpha }{\alpha }\). To obtain Theorem 3, we take \(t=c n \lambda \) with \(\lambda \ge c' n^{-\frac{1}{1+\rho }}\).
Proposition 4 will be derived from a general excess risk bound in abstract empirical risk minimization ([10], Theorem 4.3). We will state this result here for completeness. To this end, we need to introduce some notation. Let \(\mathcal{G}\) be a class of measurable functions from a probability space \((S,\mathcal{A}_S, P)\) into \([0,1]\), and let \(Z_1,\dots , Z_n\) be i.i.d. copies of an observation \(Z\) sampled from \(P.\) For any probability measure \(P\) and any \(g\in \mathcal{G}\), introduce the following notation for the expectation:
Denote by \(P_n\) the empirical measure based on \((Z_1,\dots , Z_n)\), and consider the minimizer of the empirical risk
For a function \(g\in \mathcal{G},\) define the excess risk
The set
is called the \(\delta \)-minimal set. The size of such a set will be controlled in terms of its \(L_2(P)\)-diameter
and also in terms of the following “localized empirical complexity”:
We will use these complexity measures to construct an upper confidence bound on the excess risk \(\mathcal{E}_P(\hat{f}_{n,2}).\) For a function \(\psi :{\mathbb R}_+\mapsto {\mathbb R}_+,\) define
Let
and define
The following result is the first bound of Theorem 4.3 in [10].
Proposition 5
For all \(t>0,\)
In addition to this, we will use the well-known inequality for the expected sup-norm of the empirical process in terms of bracketing entropy, see Theorem 2.14.2 in [19]. More precisely, we will need the following simplified version of that result.
Lemma 2
Let \(\mathcal{T}\) be a class of functions from \(S\) into \([0,1]\) such that \(\Vert g\Vert _{L_2(P)}\le a\) for all \(g\in \mathcal{T}.\) Assume that \(H_{[\ ]}(a, \mathcal{T},\Vert \cdot \Vert _{L_2(P)})+1\le a^2 n\). Then,
where \(\bar{C}>0\) is a universal constant.
Proof of Proposition 4
Note that if \(t>n,\) then \((\frac{t}{n})^{\varkappa /(2\varkappa -1)}> 1,\) and the result holds trivially with \(C_*=1\) since \(R(\hat{f}_{n,2})-R(f^*_P)\le 1.\) Thus, it is enough to consider the case \(t\le n.\)
Let \(S=\mathcal{X}\times \{0,1\}\) and \(P\) be the distribution of \(Z=(X,Y)\). We will apply Proposition 5 to the class \(\mathcal{G}\triangleq \{g_f: \, g_f(x,y)=I_{\{y \ne f(x)\}}, \ f\in \mathcal{F}\}\). Then, clearly, \(Pg_f=R(f)\) and \(\mathcal{E}_P (g_f)= R(f)-R(f^*_P)\) for \(g_f(x,y)=I_{\{y \ne f(x)\}},\) which implies that
We also have \(\Vert g_{f_1}-g_{f_2}\Vert _{L_2(P)}^2=\Vert f_1-f_2\Vert _{L_1(\mu _X)}.\) Thus, it follows from Lemma 1 that, for all \(g_f\in \mathcal{G}\),
and we get a bound on the \(L_2(P)\)-diameter of the \(\delta \)-minimal set \(\mathcal{F}_P(\delta ):\) with some constant \({\bar{c}}_1>0\),
To bound the function \(\phi _n(\delta ),\) we will apply Lemma 2 to the class \(\mathcal{T}=\mathcal{F}_P(\delta )\) with \(a=1\). Note that
Using (17), we easily get from Lemma 2 that, with some constants \({\bar{c}}_2, {\bar{c}}_3>0\),
which implies that, with some constant \({\bar{c}}_4>0\),
This and (22) lead to the following bound on the function \(V_n^t(\delta )\):
that holds with some constant \({\bar{c}}_5.\) Thus, we end up with a bound on \(\sigma _n^t:\)
Note that, for \(\varkappa \ge 1,\;\rho < 1\), and \(t\le n,\) we have
Therefore, (23) can be simplified as follows:
and the result immediately follows from Proposition 5.\(\square \)
1.1 Tools for the Minimax Lower Bounds
For two probability measures \(\mu \) and \(\nu \) on a measurable space \(({\mathcal {X}}, {\mathcal A})\), we define the Kullback-Leibler divergence and the \(\chi ^2\)-divergence as follows:
if \(\mu \) is absolutely continuous with respect to \(\nu \) with Radon-Nikodym derivative \(g=\frac{d\mu }{d\nu },\) and we set \(\mathcal {K}(\mu ,\nu )\triangleq +\infty \), \(\chi ^2(\mu ,\nu )\triangleq +\infty \) otherwise.
We will use the following auxiliary result.
Lemma 3
Let \(({\mathcal {X}}, {\mathcal A})\) be a measurable space, and let \(A_i \in {\mathcal A},\; i\in \{ 0,1,\dots ,M\}, M\ge 2\), be such that \(\forall i\ne j,\; A_i\cap A_j =\emptyset .\) Assume that \(Q_i\), \(i\in \{0,1\dots ,M\}\), are probability measures on \(({\mathcal {X}}, {\mathcal A})\) such that
Then,
Proof
Proposition 2.3 in [17] yields
In particular, taking \(\tau ^*=\min (M^{-1}, e^{-3\chi })\) and using that \(\sqrt{6\log M} \ge 2\) for \(M\ge 2\), we obtain
\(\square \)
We now prove a classification setting analog of the lower bound obtained by DeVore et al. [5] in the regression problem.
Theorem 5
Assume that a class \(\Theta \) of probability distributions \(P\) with the corresponding regression functions \(\eta _P\) and Bayes rules \(f^*_{P}\) (as defined above), contains a set \(\{{P_i}\}_{i=1}^N \subset \Theta ,\; N\ge 3\), with the following properties: the marginal distribution of \(X\) is \(\mu _X\) for all \(P_i\), independently of \(i\), where \(\mu _X\) is an arbitrary probability measure, \(1/4\le \eta _{P_i}\le 3/4,\; i=1,\dots ,N\), and for any \(i\ne j\),
with some \(\gamma >0,\; s>0\). Then, for any classifier \(\hat{f}_n\), we have
where \(\mathbb {P}_k\) denotes the product probability measure associated to the i.i.d. \(n\)-sample from \(P_k\).
Proof
We apply Lemma 3, where we set \(Q_i=\mathbb {P}_i\), \(M=N-1\), and define the random events \(A_i\) as follows:
The events \(A_i\) are disjoint because of (25). Thus, the theorem follows from Lemma 3 if we prove that \( \mathcal {K}(\mathbb {P}_i,\mathbb {P}_j)\le 8n\gamma ^2\) for all \(i,j\).
Let us evaluate \( \mathcal {K}(\mathbb {P}_i,\mathbb {P}_j)\). For each \(\eta _{P_i}\), the corresponding measure \(P_i\) is determined as follows:
where \(d\delta _\xi \) denotes the Dirac measure with unit mass at \(\xi \). Set for brevity \(\eta _i\triangleq \eta _{P_i}\). Fix \(i\) and \(j\). We have \(dP_i(x,y)= g(x,y)dP_j(x,y)\), where
Therefore, using the inequalities \(1/4\le \eta _{i}, \eta _j\le 3/4\) and (24), we find
Together with inequality between the Kullback and \(\chi ^2\)-divergences, cf. [17], p. 90, this yields
\(\square \)
Comment. The preprint version of this paper was posted on the Arxiv under the pseudonym N.I. Pentacaput [13]. Then the paper was submitted to “Constructive Approximation” and was accepted for publication under this pseudonym. However, it turns out that because of the Publisher rules no paper can be published under a pseudonym. As a result, we publish it under our real names that we have chosen to arrange in a random order.
Rights and permissions
About this article
Cite this article
Kerkyacharian, G., Tsybakov, A.B., Temlyakov, V. et al. Optimal Exponential Bounds on the Accuracy of Classification. Constr Approx 39, 421–444 (2014). https://doi.org/10.1007/s00365-014-9229-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00365-014-9229-3
Keywords
- Statistical learning
- Classification
- Fast rates
- Optimal rate of convergence
- Excess risk
- Margin condition
- Bahadur efficiency