Abstract
We introduce an alternative to the notion of ‘fast rate’ in Learning Theory, which coincides with the optimal error rate when the given class happens to be convex and regular in some sense. While it is well known that such a rate cannot always be attained by a learning procedure (i.e., a procedure that selects a function in the given class), we introduce an aggregation procedure that attains that rate under rather minimal assumptions—for example, that the \(L_q\) and \(L_2\) norms are equivalent on the linear span of the class for some \(q>2\), and the target random variable is square-integrable. The key components in the proof include a two-sided isomorphic estimator on distances between class members, which is based on the median-of-means; and an almost isometric lower bound of the form \(N^{-1}\sum _{i=1}^N f^2(X_i) \ge (1-\zeta )\mathbb {E}f^2\) which holds uniformly in the class. Both results only require that the \(L_q\) and \(L_2\) norms are equivalent on the linear span of the class for some \(q>2\).
Similar content being viewed by others
Notes
This is sometimes called a proper learning procedure.
Naturally, there is no chance that a nontrivial error rate would be true without any assumption on the class \({\mathcal {F}}\). The aim is to obtain such an estimate under minimal assumptions on the class—as will be specified in what follows.
Roughly put, the minimax rate is the best possible error rate one may achieve by any learning procedure, i.e., by any \(\Psi :(\Omega \times \mathbb {R})^N \rightarrow {\mathcal {F}}\).
The reason for calling \(f^*(X)-Y\) ‘noise level’ is the independent noise case, in which \(Y=f^*(X)-\xi \) and \(\xi \) is mean-zero and independent of X. Thus \(f^*(X)-Y\) is indeed the noise, and its \(L_q\) norm calibrates the noise level of the problem.
References
Anthony, M., Bartlett, P.L.: Neural network learning: theoretical foundations. Cambridge University Press, Cambridge (1999)
Audibert, J.Y.: Proof of the optimality of the empirical star algorithm, unpublished note. Available at http://certis.enpc.fr/~audibert/Mes%20articles/NIPS07supplem2. Accessed 21 June 2016
Audibert, J.Y.: Fast learning rates in statistical inference through aggregation. Ann. Stat. 37(4), 1591–1646 (2009)
Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford (2013)
Catoni, O.: Statistical learning theory and stochastic optimization, vol. 1851 of Lecture Notes in Mathematics, Springer, Berlin (2004)
de la Peña, V., Giné, E.: Decoupling: from dependence to independence. Springer-Verlag, Berlin (1999)
Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–989 (1984)
Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Ann. Stat. 36(5), 2183–2206 (2008)
Koltchinskii, V., Mendelson, S.: Bounding the smallest singular value of a random matrix without concentration. http://arxiv.org/abs/1312.3580 (preprint)
Lecué, G.: HDR Thesis. Available at http://www.cmap.polytechnique.fr/~lecue/HDR. Accessed 21 June 2016
Lecué, G., Mendelson, S.: Aggregation via Empirical risk minimization. Probab. Theor. Relat. Fields 145, 591–613 (2009)
Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds. Available at http://arxiv.org/abs/1305.4825 (preprint)
Lecué, G., Mendelson, S.: Performance of empirical risk minimization in linear aggregation. Bernoulli 22(3), 1520–1534 (2016). doi:10.3150/15-BEJ701
Ledoux, M.: The concentration of measure phenomenon. In: Mathematical Surveys and Monographs, vol. 89, p. x\(+\)181. American Mathematical Society, Providence, RI (2001)
Ledoux, M., Talagrand, M.: Probability in banach spaces. Isoperimetry and processes, Ergebnisse der Mathematik und ihrer Grenzgebiete (3), vol. 23. Springer-Verlag, Berlin (1991)
Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Reconstruction and subgaussian operators. Geom. Funct. Anal. 17(4), 1248–1282 (2007)
Mendelson, S.: A remark on the diameter of random sections of convex bodies, geometric aspects of functional analysis (GAFA Seminar Notes), In: Klartag, B., Milman, E. (eds) Lecture notes in Mathematics 2116, pp. 395–404 (2014)
Mendelson, S.: Learning without concentration. J ACM 62(3)(21), 1–25 (2015)
Mendelson, S.: Learning without concentration for a general loss function. http://arxiv.org/abs/1410.3192 (preprint)
Mendelson, S.: Upper bounds on product and multiplier empirical processes. Available at http://arxiv.org/abs/1410.8003 (preprint)
Mossel, E., O’Donnell, R., Oleszkiewicz, K.: Noise stability of functions with low influences: invariance and optimality. Ann. Math. 171(1), 295–341 (2010)
Nemirovski, A.: Topics in non-parametric statistics, In: Lectures on probability theory and statistics (Saint-Flour, 1998), vol 1738 of Lecture Notes in Math., pp. 85–277. Springer, Berlin (2000)
Pisier, G.: The volume of convex bodies and banach space geometry. In: Cambridge Tracts in Mathematics, vol. 94, p. xvi\(+\)250. Cambridge University Press, Cambridge (1989). doi:10.1017/CBO9780511662454
Talagrand, M.: Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22(1), 28–76 (1994)
Tsybakov, A.B.: Introduction to nonparametric estimation. Springer, New York (2009)
Van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer Verlag, Berlin (1996)
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported in part by the Mathematical Sciences Institute, The Australian National University, Canberra, ACT 2601, Australia, and by an Israel Science Foundation grant 707/14.
Rights and permissions
About this article
Cite this article
Mendelson, S. On aggregation for heavy-tailed classes. Probab. Theory Relat. Fields 168, 641–674 (2017). https://doi.org/10.1007/s00440-016-0720-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-016-0720-6