Learning without concentration for general loss functions

Mendelson, Shahar

doi:10.1007/s00440-017-0784-y

Learning without concentration for general loss functions

Published: 12 June 2017

Volume 171, pages 459–502, (2018)
Cite this article

Probability Theory and Related Fields Aims and scope Submit manuscript

Shahar Mendelson^1,2

822 Accesses
17 Citations
Explore all metrics

Abstract

We study the performance of empirical risk minimization in prediction and estimation problems that are carried out in a convex class and relative to a sufficiently smooth convex loss function. The framework is based on the small-ball method and thus is suited for heavy-tailed problems. Moreover, among its outcomes is that a well-chosen loss, calibrated to fit the noise level of the problem, negates some of the ill-effects of outliers and boosts the confidence level—leading to a gaussian like behaviour even when the target random variable is heavy-tailed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Introduction to Machine Learning

Notes

We refer to \(f^*(X)-Y\) as the noise of the problem. This name makes perfect sense when \(Y=f_0(X)-W\) for a mean-zero random variable W that is independent of X, and we use the term ‘noise’ even when the target does not have that particular form.
The log-loss is more commonly used in the context of binary classification problems rather than in the type of real-valued problems we study here. However, because of its convexity properties it is an interesting example of the phenomenon we explore.
Let us mention that it is possible to modify the arguments and tackle situations in which the constants \(\kappa \) and \(\varepsilon \) are not uniform, but to keep this article at a reasonable length we defer this to future work.
One may show that under rather reasonable conditions, if \(f^*_\gamma \) is the true minimizer of the Huber loss with parameter \(\gamma \) and \(f^*\) is the true minimizer of the squared loss then \(\Vert f_\gamma ^*(X)-Y\Vert _{L_2} \lesssim \Vert f^*(X)-Y\Vert _{L_2}\).

References

Bartlett, P.L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Article MathSciNet MATH Google Scholar
Birgé, L., Massart, P.: Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields 97(1–2), 113–150 (1993)
Article MathSciNet MATH Google Scholar
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013). ISBN 978-0-19-953525-5
Book MATH Google Scholar
Bühlmann, P., van de Geer, S.: Statistics for high-dimensional data. In: Springer Series in Statistics. Methods, Theory and Applications, pp. xviii, 556. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20192-9
de la Peña, V.H., Giné, E.: Decoupling. From Dependence to Independence, Randomly Stopped Processes. \(U\)-Statistics and Processes. Martingales and Beyond. Probability and Its Applications. Springer, New York (1999)
MATH Google Scholar
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition, Volume 31 of Applications of Mathematics. Springer, New York (1996)
Book MATH Google Scholar
Dudley, R.M.: Uniform Central Limit Theorems, Volume 63 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge (1999)
Book Google Scholar
Giné, E., Zinn, J.: Some limit theorems for empirical processes. Ann. Probab. 12(4), 929–998 (1984). With discussion
Article MathSciNet MATH Google Scholar
Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, Volume 2033 of Lecture Notes in Mathematics. Lectures from the 38th Probability Summer School Held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. Springer, Heidelberg (2011)
Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds. Technical Report, CNRS, Ecole polytechnique and Technion (2013)
Ledoux, M.: The Concentration of Measure Phenomenon. American Mathematical Society, Providence, RI (2001)
MATH Google Scholar
Ledoux, M., Talagrand, M.: Probability in Banach Spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)]. Isoperimetry and Processes. Springer, Berlin (1991)
Google Scholar
Lee, W.S., Bartlett, P.L., Williamson, R.C.: The importance of convexity in learning with squared loss. In: Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 140–146. ACM Press (1996)
Lugosi, G., Mendelson, S.: Risk minimization by median-of-means tournaments. (2016). https://arxiv.org/abs/1608.00757
Massart, P.: Concentration Inequalities and Model Selection, Volume 1896 of Lecture Notes in Mathematics. Lectures from the 33rd Summer School on Probability Theory Held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard. Springer, Berlin (2007)
Mendelson, S.: Improving the sample complexity using global data. IEEE Trans. Inf. Theory 48(7), 1977–1991 (2002)
Article MathSciNet MATH Google Scholar
Mendelson, S.: Obtaining fast error rates in nonconvex situations. J. Complex. 24(3), 380–397 (2008)
Article MathSciNet MATH Google Scholar
Mendelson, S.: Learning without concentration for general loss functions. Technical Report, Technion (2014). http://arxiv.org/abs/1410.3192
Mendelson, S.: Learning without concentration. J. ACM 62(3), Art. 21, 25 (2015)
Article MathSciNet MATH Google Scholar
Mendelson, S.: Upper bounds on product and multiplier empirical processes. Stoch. Process. Appl. 126(12), 3652–3680 (2016)
Article MathSciNet MATH Google Scholar
Mendelson, S.: On aggregation for heavy-tailed classes. Probab. Theory Relat. Fields (2016). doi:10.1007/s00440-016-0720-6
Milman, V.D., Schechtman, G.: Asymptotic Theory of Finite-Dimensional Normed Spaces, Volume 1200 of Lecture Notes in Mathematics. With an appendix by M. Gromov. Springer, Berlin (1986)
Pisier, G.: The Volume of Convex Bodies and Banach Space Geometry, Volume 94 of Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge (1989)
Book Google Scholar
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York (2009) (Revised and extended from the 2004 French original, Translated by Vladimir Zaiats)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York (1996). (With applications to statistics)
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Technion — Israel Institute of Technology, Haifa, Israel
Shahar Mendelson
Mathematical Sciences Institute, The Australian National University, Canberra, Australia
Shahar Mendelson

Authors

Shahar Mendelson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahar Mendelson.

Additional information

Partially supported by ISF Grant 707/14.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mendelson, S. Learning without concentration for general loss functions. Probab. Theory Relat. Fields 171, 459–502 (2018). https://doi.org/10.1007/s00440-017-0784-y

Download citation

Received: 06 April 2016
Revised: 08 May 2017
Published: 12 June 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s00440-017-0784-y

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning without concentration for general loss functions

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Introduction to Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Navigation

Learning without concentration for general loss functions

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Introduction to Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation