Machine Learning

, Volume 106, Issue 9–10, pp 1643–1679 | Cite as

Robust regression using biased objectives

Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2017 Journal Track


For the regression task in a non-parametric setting, designing the objective function to be minimized by the learner is a critical task. In this paper we propose a principled method for constructing and minimizing robust losses, which are resilient to errant observations even under small samples. Existing proposals typically utilize very strong estimates of the true risk, but in doing so require a priori information that is not available in practice. As we abandon direct approximation of the risk, this lets us enjoy substantial gains in stability at a tolerable price in terms of bias, all while circumventing the computational issues of existing procedures. We analyze existence and convergence conditions, provide practical computational routines, and also show empirically that the proposed method realizes superior robustness over wide data classes with no prior knowledge assumptions.


Robust loss Heavy-tailed noise Risk minimization 



The authors would like to thank the anonymous reviewers for their constructive comments, which resulted in substantial improvements to the manuscript.

Supplementary material (119 kb)
Supplementary material 1 (zip 118 KB)


  1. Abramowitz, M., & Stegun, I. A. (1964). Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series (Vol. 55). US National Bureau of Standards.Google Scholar
  2. Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.MathSciNetCrossRefMATHGoogle Scholar
  3. Ash, R. B., & Doléans-Dade, C. A. (2000). Probability and measure theory (2nd ed.). New York: Academic Press.MATHGoogle Scholar
  4. Audibert, J. Y., & Catoni, O. (2011). Robust linear least squares regression. Annals of Statistics, 39(5), 2766–2794.MathSciNetCrossRefMATHGoogle Scholar
  5. Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452.MathSciNetCrossRefMATHGoogle Scholar
  6. Bartlett, P. L., & Mendelson, S. (2006). Empirical minimization. Probability Theory and Related Fields, 135(3), 311–334.MathSciNetCrossRefMATHGoogle Scholar
  7. Bartlett, P. L., Mendelson, S., & Neeman, J. (2012). \(\ell _{1}\)-regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 154(1–2), 193–224.MathSciNetCrossRefMATHGoogle Scholar
  8. Breiman, L. (1968). Probability. Reading, MA: Addison-Wesley.MATHGoogle Scholar
  9. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MATHGoogle Scholar
  10. Brent, R. P. (1973). Algorithms for minimization without derivatives. Englewood Cliffs, NJ: Prentice-Hall.MATHGoogle Scholar
  11. Brownlees, C., Joly, E., & Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43(6), 2507–2536.MathSciNetCrossRefMATHGoogle Scholar
  12. Catoni, O. (2009). High confidence estimates of the mean of heavy-tailed real random variables. arXiv preprint arXiv:0909.5366.
  13. Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4), 1148–1185.MathSciNetCrossRefMATHGoogle Scholar
  14. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin (New Series) of the American Mathematical Society, 39(1), 1–49.MathSciNetCrossRefMATHGoogle Scholar
  15. Dellacherie, C., & Meyer, P. A. (1978). Probabilities and potential, North-Holland Mathematics Studies (Vol. 29). Amsterdam: North-Holland.Google Scholar
  16. Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv preprint arXiv:1509.05845.
  17. Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals of Probability, 6(6), 899–929.MathSciNetCrossRefMATHGoogle Scholar
  18. Dudley, R. M. (2014). Uniform central limit theorems (2nd ed.). Cambridge, MA: Cambridge University Press.MATHGoogle Scholar
  19. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MathSciNetCrossRefMATHGoogle Scholar
  20. Geman, D., & Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(3), 367–383.CrossRefGoogle Scholar
  21. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.CrossRefMATHGoogle Scholar
  22. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. New York: Wiley.MATHGoogle Scholar
  23. Hsu, D., & Sabato, S. (2014). Heavy-tailed regression with a generalized median-of-means. In Proceedings of the 31st international conference on machine learning (ICML2014) (pp. 37–45).Google Scholar
  24. Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetMATHGoogle Scholar
  25. Hsu, D., Kakade, S. M., & Zhang, T. (2014). Random design analysis of ridge regression. Foundations of Computational Mathematics, 14(3), 569–600.MathSciNetCrossRefMATHGoogle Scholar
  26. Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73–101.MathSciNetCrossRefMATHGoogle Scholar
  27. Huber, P. J. (1981). Robust statistics (1st ed.). New York: Wiley.CrossRefMATHGoogle Scholar
  28. Huber, P. J., & Ronchetti, E. M. (2009). Robust statistics (2nd ed.). New York: Wiley.CrossRefMATHGoogle Scholar
  29. Kearns, M. J., & Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 464–497.MathSciNetCrossRefMATHGoogle Scholar
  30. Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46(1), 33–50.MathSciNetCrossRefMATHGoogle Scholar
  31. Lerasle, M., & Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
  32. Lugosi, G., & Mendelson, S. (2016). Risk minimization by median-of-means tournaments. arXiv preprint arXiv:1608.00757.
  33. Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefMATHGoogle Scholar
  34. Pollard, D. (1981). Limit theorems for empirical processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(2), 181–195.MathSciNetCrossRefMATHGoogle Scholar
  35. Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.CrossRefMATHGoogle Scholar
  36. R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
  37. Rousseeuw, P., & Yohai, V. (1984). Robust regression by means of S-estimators. In Robust and nonlinear time series analysis, Lecture Notes in Statistics (Vol. 26, pp. 256–272). Berlin: Springer.Google Scholar
  38. Salibian-Barrera, M., & Yohai, V. J. (2006). A fast algorithm for S-regression estimates. Journal of Computational and Graphical Statistics, 15(2), 1–14.MathSciNetCrossRefGoogle Scholar
  39. Shalev-Shwartz, S., Shamir, O., Srebro, N., & Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11, 2635–2670.MathSciNetMATHGoogle Scholar
  40. Srebro, N., Sridharan, K., & Tewari, A. (2010). Smoothness, low noise and fast rates. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 23, pp. 2199–2207).Google Scholar
  41. Steele, J. M. (1975). Combinatorial entropy and uniform limit laws, Ph.D thesis. Stanford University.Google Scholar
  42. Takeuchi, I., Le, Q. V., Sears, T. D., & Smola, A. J. (2006). Nonparametric quantile estimation. Journal of Machine Learning Research, 7, 1231–1264.MathSciNetMATHGoogle Scholar
  43. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267–288.Google Scholar
  44. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2), 264–280.CrossRefMATHGoogle Scholar
  45. Vardi, Y., & Zhang, C. H. (2000). The multivariate \(L_{1}\)-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4), 1423–1426.MathSciNetCrossRefMATHGoogle Scholar
  46. Yu, Y., Aslan, Ö., & Schuurmans, D. (2012). A polynomial-time form of robust regression. Advances in Neural Information Processing Systems, 25, 2483–2491.Google Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.Graduate School of Information ScienceNara Institute of Science and TechnologyIkoma, NaraJapan

Personalised recommendations