Foundations of Computational Mathematics

, Volume 19, Issue 5, pp 1145–1190 | Cite as

Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey

  • Gábor LugosiEmail author
  • Shahar Mendelson


We survey some of the recent advances in mean estimation and regression function estimation. In particular, we describe sub-Gaussian mean estimators for possibly heavy-tailed data in both the univariate and multivariate settings. We focus on estimators based on median-of-means techniques, but other methods such as the trimmed-mean and Catoni’s estimators are also reviewed. We give detailed proofs for the cornerstone results. We dedicate a section to statistical learning problems—in particular, regression function estimation—in the presence of possibly heavy-tailed data.


Mean estimation Heavy-tailed distributions Robustness Regression function estimation Statistical learning 

Mathematics Subject Classification

62G05 62G15 62G35 



We thank Sam Hopkins, Stanislav Minsker, and Roberto Imbuzeiro Oliveira for illuminating discussions on the subject. We also thank two referees for their thorough reports and insightful comments.


  1. 1.
    N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58:137–147, 2002.MathSciNetCrossRefGoogle Scholar
  2. 2.
    G. Aloupis. Geometric measures of data depth. DIMACS series in discrete mathematics and theoretical computer science, 72:147–158, 2006.MathSciNetCrossRefGoogle Scholar
  3. 3.
    M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. CrossRefGoogle Scholar
  4. 4.
    J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 39:2766–2794, 2011.MathSciNetCrossRefGoogle Scholar
  5. 5.
    Y. Baraud and L. Birgé. Rho-estimators revisited: General theory and applications. The Annals of Statistics, 46(6B):3767–3804, 2018.MathSciNetCrossRefGoogle Scholar
  6. 6.
    Y. Baraud, L. Birgé, and M. Sart. A new method for estimation and model selection: \(\rho \)-estimation. Inventiones Mathematicae, 207(2):425–517, 2017.MathSciNetCrossRefGoogle Scholar
  7. 7.
    P.L. Bartlett, O. Bousquet, and S. Mendelson. Localized Rademacher complexities. Annals of Statistics, 33:1497–1537, 2005.MathSciNetCrossRefGoogle Scholar
  8. 8.
    P.J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics, 36:847–858, 1965.MathSciNetCrossRefGoogle Scholar
  9. 9.
    A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik–Chervonenkis dimension. Journal of the ACM, 36:929–965, 1989.MathSciNetCrossRefGoogle Scholar
  10. 10.
    S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities:A Nonasymptotic Theory of Independence. Oxford University Press, 2013.CrossRefGoogle Scholar
  11. 11.
    C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43:2507–2536, 2015.MathSciNetCrossRefGoogle Scholar
  12. 12.
    S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59:7711–7717, 2013.MathSciNetCrossRefGoogle Scholar
  13. 13.
    P. Bühlmann and S. van de Geer. Statistics for high-dimensional data. Springer Series in Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.CrossRefGoogle Scholar
  14. 14.
    O. Catoni. Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148–1185, 2012.MathSciNetCrossRefGoogle Scholar
  15. 15.
    O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv preprint arXiv:1712.02747, 2017.
  16. 16.
    O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv preprint arXiv:1802.04308, 2018.
  17. 17.
    Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 47–60. ACM, 2017.Google Scholar
  18. 18.
    Y. Cherapanamjeri, N. Flammarion, and P. Bartlett. Fast mean estimation with sub-Gaussian rates. arXiv preprint arXiv:1902.01998, 2019.
  19. 19.
    M. Chichignoud and J. Lederer. A robust, adaptive m-estimator for pointwise estimation in heteroscedastic regression. Bernoulli, 20(3):1560–1599, 2014.MathSciNetCrossRefGoogle Scholar
  20. 20.
    M.B. Cohen, Y.T. Lee, G. Miller, J. Pachocki, and A. Sidford. Geometric median in nearly linear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pages 9–21. ACM, 2016.Google Scholar
  21. 21.
    L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996.CrossRefGoogle Scholar
  22. 22.
    L. Devroye, M. Lerasle, G. Lugosi, and R.I. Oliveira. Sub-Gaussian mean estimators. Annals of Statistics, 2016.Google Scholar
  23. 23.
    I. Diakonikolas, G. Kamath, D.M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655–664. IEEE, 2016.Google Scholar
  24. 24.
    I. Diakonikolas, G. Kamath, D.M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.Google Scholar
  25. 25.
    I. Diakonikolas, G. Kamath, D.M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702. Society for Industrial and Applied Mathematics, 2018.Google Scholar
  26. 26.
    I. Diakonikolas, D.M. Kane, and A. Stewart. Efficient robust proper learning of log-concave distributions. arXiv preprint arXiv:1606.03077, 2016.
  27. 27.
    I. Diakonikolas, W. Kong, and A. Stewart. Efficient algorithms and lower bounds for robust linear regression. arXiv preprint arXiv:1806.00040, 2018.
  28. 28.
    J. Fan, Q. Li, and Y. Wang. Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265, 2017.MathSciNetCrossRefGoogle Scholar
  29. 29.
    L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A distribution-free theory of nonparametric regression. Springer-Verlag, New York, 2002.CrossRefGoogle Scholar
  30. 30.
    F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel. Robust statistics: the approach based on influence functions, volume 196. Wiley, 1986.zbMATHGoogle Scholar
  31. 31.
    Q. Han and J.A. Wellner. A sharp multiplier inequality with applications to heavy-tailed regression problems. arXiv preprint arXiv:1706.02410, 2017.
  32. 32.
    W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963.MathSciNetCrossRefGoogle Scholar
  33. 33.
    S.B. Hopkins. Sub-Gaussian mean estimation in polynomial time. Annals of Statistics, 2019, to appear.Google Scholar
  34. 34.
    S.B. Hopkins and J. Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034. ACM, 2018.Google Scholar
  35. 35.
  36. 36.
    D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17:1–40, 2016.MathSciNetzbMATHGoogle Scholar
  37. 37.
    M. Huber. An optimal (\(\epsilon \), \(\delta \))-randomized approximation scheme for the mean of random variables with bounded relative variance. Random Structures & Algorithms, 2019.Google Scholar
  38. 38.
    P.J. Huber. Robust estimation of a location parameter. The annals of mathematical statistics, 35(1):73–101, 1964.MathSciNetCrossRefGoogle Scholar
  39. 39.
    P.J. Huber and E.M. Ronchetti. Robust statistics. Wiley, New York, 2009. Second edition.Google Scholar
  40. 40.
    M. Jerrum, L. Valiant, and V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:186–188, 1986.MathSciNetCrossRefGoogle Scholar
  41. 41.
    E. Joly, G. Lugosi, and R. I. Oliveira. On the estimation of the mean of a random vector. Electronic Journal of Statistics, 11:440–451, 2017.MathSciNetCrossRefGoogle Scholar
  42. 42.
    A. Klivans, P.K. Kothari, and R. Meka. Efficient algorithms for outlier-robust regression. In Proceedings of the 31st Annual Conference of Learning Theory (COLT 2018), 2018.Google Scholar
  43. 43.
    V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg, 2011. Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School].Google Scholar
  44. 44.
    P.K. Kothari, J. Steinhardt, and D. Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046. ACM, 2018.Google Scholar
  45. 45.
    Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665–674. IEEE, 2016.Google Scholar
  46. 46.
    G. Lecué and M. Lerasle. Learning from mom’s principles: Le cam’s approach. arXiv preprint arXiv:1701.01961, 2017.
  47. 47.
    G. Lecué and M. Lerasle. Robust machine learning by median-of-means: theory and practice. Annals of Stastistics, 2019, to appear.Google Scholar
  48. 48.
    G. Lecué, M. Lerasle, and T. Mathieu. Robust classification via mom minimization. arXiv preprint arXiv:1808.03106, 2018.
  49. 49.
    G. Lecué and S. Mendelson. Learning subgaussian classes: Upper and minimax bounds. In S. Boucheron and N. Vayatis, editors, Topics in Learning Theory. Societe Mathematique de France, 2016.Google Scholar
  50. 50.
    G. Lecué and S. Mendelson. Performance of empirical risk minimization in linear aggregation. Bernoulli, 22(3):1520–1534, 2016.MathSciNetCrossRefGoogle Scholar
  51. 51.
    M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001.zbMATHGoogle Scholar
  52. 52.
    M. Ledoux and M. Talagrand. Probability in Banach Space. Springer-Verlag, New York, 1991.CrossRefGoogle Scholar
  53. 53.
    M. Lerasle and R. I. Oliveira. Robust empirical mean estimators. arXiv:1112.3914, 2012.
  54. 54.
    Po-Ling Loh and Xin Lu Tan. High-dimensional robust precision matrix estimation: Cellwise corruption under \(\epsilon \)-contamination. Electronic Journal of Statistics, 12(1):1429–1467, 2018.MathSciNetCrossRefGoogle Scholar
  55. 55.
    G. Lugosi and S. Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. manuscript, 2019.Google Scholar
  56. 56.
    G. Lugosi and S. Mendelson. Sub-Gaussian estimators of the mean of a random vector. Annals of Statistics, 47:783–794, 2019.MathSciNetCrossRefGoogle Scholar
  57. 57.
    G. Lugosi and S. Mendelson. Near-optimal mean estimators with respect to general norms. Probability Theory and Related Fields, 2019, to appear.Google Scholar
  58. 58.
    G. Lugosi and S. Mendelson. Regularization, sparse recovery, and median-of-means tournaments. Bernoulli, 2019, to appear.Google Scholar
  59. 59.
    G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, 2019, to appear.Google Scholar
  60. 60.
    P. Massart. Concentration inequalities and model selection. Ecole d’été de Probabilités de Saint-Flour 2003. Lecture Notes in Mathematics. Springer, 2006.Google Scholar
  61. 61.
    S. Mendelson. Learning without concentration. Journal of the ACM, 62:21, 2015.MathSciNetCrossRefGoogle Scholar
  62. 62.
    S. Mendelson. An optimal unrestricted learning procedure. arXiv preprint arXiv:1707.05342, 2017.
  63. 63.
    S. Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1-2):459–502, 2018.MathSciNetCrossRefGoogle Scholar
  64. 64.
    S. Mendelson and N. Zhivotovskiy. Robust covariance estimation under \({L}_4-{L}_2\) norm equivalence. arXiv preprint arXiv:1809.10462, 2018.
  65. 65.
    S. Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21:2308–2335, 2015.MathSciNetCrossRefGoogle Scholar
  66. 66.
    Stanislav Minsker. Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46(6A):2871–2903, 2018.MathSciNetCrossRefGoogle Scholar
  67. 67.
    Stanislav Minsker. Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523, 2018.
  68. 68.
    Stanislav Minsker and Nate Strawn. Distributed statistical estimation and rates of convergence in normal approximation. arXiv preprint arXiv:1704.02658, 2017.
  69. 69.
    A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. 1983.Google Scholar
  70. 70.
    Roberto I. Oliveira and Paulo Orenstein. The sub-Gaussian property of trimmed means estimators. Technical report, IMPA, 2019.Google Scholar
  71. 71.
    Valentin V Petrov. Limit theorems of probability theory: sequences of independent random variables. Technical report, Oxford, New York, 1995.Google Scholar
  72. 72.
    IG Shevtsova. On the absolute constants in the Berry–Esseen-type inequalities. In Doklady Mathematics, volume 89, pages 378–381. Springer, 2014.Google Scholar
  73. 73.
    C.G. Small. A survey of multidimensional medians. International Statistical Review, pages 263–277, 1990.Google Scholar
  74. 74.
    S.M. Stigler. The asymptotic distribution of the trimmed mean. The Annals of Statistics, 1:472–477, 1973.MathSciNetCrossRefGoogle Scholar
  75. 75.
    B.S. Tsirelson, I.A. Ibragimov, and V.N. Sudakov. Norm of Gaussian sample function. In Proceedings of the 3rd Japan-U.S.S.R. Symposium on Probability Theory, volume 550 of Lecture Notes in Mathematics, pages 20–41. Springer-Verlag, Berlin, 1976.Google Scholar
  76. 76.
    A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009.CrossRefGoogle Scholar
  77. 77.
    J.W. Tukey. Mathematics and the picturing of data. In Proceedings of the International Congress of Mathematicians, Vancouver, 1975, volume 2, pages 523–531, 1975.Google Scholar
  78. 78.
    J.W. Tukey and D.H. McLaughlin. Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A, 25:331–352, 1963.MathSciNetzbMATHGoogle Scholar
  79. 79.
    L.G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984.CrossRefGoogle Scholar
  80. 80.
    S. van de Geer. Applications of empirical process theory, volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.Google Scholar
  81. 81.
    A.W. van der Waart and J.A. Wellner. Weak convergence and empirical processes. Springer, 1996.Google Scholar
  82. 82.
    V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.Google Scholar
  83. 83.
    R. Vershynin. Lectures in geometric functional analysis. 2009.Google Scholar

Copyright information

© SFoCM 2019

Authors and Affiliations

  1. 1.Department of Economics and BusinessPompeu Fabra UniversityBarcelonaSpain
  2. 2.ICREABarcelonaSpain
  3. 3.Barcelona Graduate School of EconomicsBarcelonaSpain
  4. 4.Mathematical Sciences InstituteThe Australian National UniversityCanberraAustralia
  5. 5.LPSMSorbonne UniversityParisFrance

Personalised recommendations