Abstract
We study the performance of empirical risk minimization in prediction and estimation problems that are carried out in a convex class and relative to a sufficiently smooth convex loss function. The framework is based on the small-ball method and thus is suited for heavy-tailed problems. Moreover, among its outcomes is that a well-chosen loss, calibrated to fit the noise level of the problem, negates some of the ill-effects of outliers and boosts the confidence level—leading to a gaussian like behaviour even when the target random variable is heavy-tailed.
Similar content being viewed by others
Notes
We refer to \(f^*(X)-Y\) as the noise of the problem. This name makes perfect sense when \(Y=f_0(X)-W\) for a mean-zero random variable W that is independent of X, and we use the term ‘noise’ even when the target does not have that particular form.
The log-loss is more commonly used in the context of binary classification problems rather than in the type of real-valued problems we study here. However, because of its convexity properties it is an interesting example of the phenomenon we explore.
Let us mention that it is possible to modify the arguments and tackle situations in which the constants \(\kappa \) and \(\varepsilon \) are not uniform, but to keep this article at a reasonable length we defer this to future work.
One may show that under rather reasonable conditions, if \(f^*_\gamma \) is the true minimizer of the Huber loss with parameter \(\gamma \) and \(f^*\) is the true minimizer of the squared loss then \(\Vert f_\gamma ^*(X)-Y\Vert _{L_2} \lesssim \Vert f^*(X)-Y\Vert _{L_2}\).
References
Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Birgé, L., Massart, P.: Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields 97(1–2), 113–150 (1993)
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013). ISBN 978-0-19-953525-5
Bühlmann, P., van de Geer, S.: Statistics for high-dimensional data. In: Springer Series in Statistics. Methods, Theory and Applications, pp. xviii, 556. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20192-9
de la Peña, V.H., Giné, E.: Decoupling. From Dependence to Independence, Randomly Stopped Processes. \(U\)-Statistics and Processes. Martingales and Beyond. Probability and Its Applications. Springer, New York (1999)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition, Volume 31 of Applications of Mathematics. Springer, New York (1996)
Dudley, R.M.: Uniform Central Limit Theorems, Volume 63 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge (1999)
Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–998 (1984). With discussion
Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, Volume 2033 of Lecture Notes in Mathematics. Lectures from the 38th Probability Summer School Held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. Springer, Heidelberg (2011)
Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds. Technical Report, CNRS, Ecole polytechnique and Technion (2013)
Ledoux, M.: The Concentration of Measure Phenomenon. American Mathematical Society, Providence, RI (2001)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Isoperimetry and Processes. Springer, Berlin (1991)
Lee, W.S., Bartlett, P.L., Williamson, R.C.: The importance of convexity in learning with squared loss. In: Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 140–146. ACM Press (1996)
Lugosi, G., Mendelson, S.: Risk minimization by median-of-means tournaments. (2016). https://arxiv.org/abs/1608.00757
Massart, P.: Concentration Inequalities and Model Selection, Volume 1896 of Lecture Notes in Mathematics. Lectures from the 33rd Summer School on Probability Theory Held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard. Springer, Berlin (2007)
Mendelson, S.: Improving the sample complexity using global data. IEEE Trans. Inf. Theory 48(7), 1977–1991 (2002)
Mendelson, S.: Obtaining fast error rates in nonconvex situations. J. Complex. 24(3), 380–397 (2008)
Mendelson, S.: Learning without concentration for general loss functions. Technical Report, Technion (2014). http://arxiv.org/abs/1410.3192
Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015)
Mendelson, S.: Upper bounds on product and multiplier empirical processes. Stoch. Process. Appl. 126(12), 3652–3680 (2016)
Mendelson, S.: On aggregation for heavy-tailed classes. Probab. Theory Relat. Fields (2016). doi:10.1007/s00440-016-0720-6
Milman, V.D., Schechtman, G.: Asymptotic Theory of Finite-Dimensional Normed Spaces, Volume 1200 of Lecture Notes in Mathematics. With an appendix by M. Gromov. Springer, Berlin (1986)
Pisier, G.: The Volume of Convex Bodies and Banach Space Geometry, Volume 94 of Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge (1989)
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York (2009) (Revised and extended from the 2004 French original, Translated by Vladimir Zaiats)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York (1996). (With applications to statistics)
Author information
Authors and Affiliations
Corresponding author
Additional information
Partially supported by ISF Grant 707/14.
Rights and permissions
About this article
Cite this article
Mendelson, S. Learning without concentration for general loss functions. Probab. Theory Relat. Fields 171, 459–502 (2018). https://doi.org/10.1007/s00440-017-0784-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-017-0784-y