Machine Learning

, Volume 108, Issue 8–9, pp 1523–1560 | Cite as

Efficient learning with robust gradient descent

  • Matthew J. HollandEmail author
  • Kazushi Ikeda
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2019 Journal Track


Minimizing the empirical risk is a popular training strategy, but for learning tasks where the data may be noisy or heavy-tailed, one may require many observations in order to generalize well. To achieve better performance under less stringent requirements, we introduce a procedure which constructs a robust approximation of the risk gradient for use in an iterative learning routine. Using high-probability bounds on the excess risk of this algorithm, we show that our update does not deviate far from the ideal gradient-based update. Empirical tests using both controlled simulations and real-world benchmark data show that in diverse settings, the proposed procedure can learn more efficiently, using less resources (iterations and observations) while generalizing better.


Robust learning Stochastic optimization Statistical learning theory 



This work was partially supported by the Grant-in-Aid for JSPS Research Fellows.

Supplementary material


  1. Abramowitz, M., & Stegun, I. A. (1964). Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series, vol 55. US National Bureau of Standards.Google Scholar
  2. Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.MathSciNetCrossRefzbMATHGoogle Scholar
  3. Ash, R. B., & Doleans-Dade, C. (2000). Probability and measure theory. Cambridge: Academic Press.zbMATHGoogle Scholar
  4. Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.MathSciNetzbMATHGoogle Scholar
  6. Brownlees, C., Joly, E., & Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43(6), 2507–2536.MathSciNetCrossRefzbMATHGoogle Scholar
  7. Catoni, O. (2009). High confidence estimates of the mean of heavy-tailed real random variables. arXiv preprint arXiv:0909.5366.
  8. Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4), 1148–1185.MathSciNetCrossRefzbMATHGoogle Scholar
  9. Chen, Y., Su, L., & Xu, J. (2017a). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491.
  10. Chen, Y., Su, L., & Xu, J. (2017b). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2), 44.Google Scholar
  11. Daniely, A., & Shalev-Shwartz, S. (2014). Optimal learners for multiclass problems. In 27th annual conference on learning theory, proceedings of machine learning research (vol. 35, pp. 287–316).Google Scholar
  12. Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv preprint arXiv:1509.05845.
  13. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.MathSciNetzbMATHGoogle Scholar
  14. Feldman, V. (2016). Generalization of ERM in stochastic convex optimization: The dimension strikes back. Advances in Neural Information Processing Systems, 29, 3576–3584.Google Scholar
  15. Finkenstädt, B., & Rootzén, H. (Eds.). (2003). Extreme values in finance, telecommunications, and the environment. Boca Raton: CRC Press.Google Scholar
  16. Frostig, R., Ge, R., Kakade, S. M., & Sidford, A. (2015). Competing with the empirical risk minimizer in a single pass. arXiv preprint arXiv:1412.6606.
  17. Holland, M. J., & Ikeda, K. (2017a). Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182.
  18. Holland, M. J., & Ikeda, K. (2017b). Robust regression using biased objectives. Machine Learning, 106(9), 1643–1679. Scholar
  19. Hsu, D., & Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18), 1–40.MathSciNetzbMATHGoogle Scholar
  20. Huber, P. J., & Ronchetti, E. M. (2009). Robust statistics (2nd ed.). New York: Wiley.CrossRefzbMATHGoogle Scholar
  21. Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 315–323.Google Scholar
  22. Kearns, M. J., & Schapire, R. E. (1994). Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48, 464–497.MathSciNetCrossRefzbMATHGoogle Scholar
  23. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  24. Kolmogorov, A. N. (1993). \(\varepsilon \)-entropy and \(\varepsilon \)-capacity of sets in functional spaces. In A. N. Shiryayev (Ed.), Selected works of A. N. Kolmogorov, volume III: Information theory and the theory of algorithms (pp. 86–170). Berlin: Springer.Google Scholar
  25. Le Roux, N., Schmidt, M., & Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 2663–2671.Google Scholar
  26. Lecué, G., & Lerasle, M.(2017). Learning from MOM’s principles. arXiv preprint arXiv:1701.01961.
  27. Lecué, G., Lerasle, M., & Mathieu, T. (2018). Robust classification via MOM minimization. arXiv preprint arXiv:1808.03106.
  28. Lerasle, M., & Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
  29. Lin, J., & Rosasco, L. (2016). Optimal learning for multi-pass stochastic gradient methods. Advances in Neural Information Processing Systems, 29, 4556–4564.Google Scholar
  30. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley.zbMATHGoogle Scholar
  31. Lugosi, G., & Mendelson, S. (2016). Risk minimization by median-of-means tournaments. arXiv preprint arXiv:1608.00757.
  32. Minsker, S., & Strawn, N. (2017). Distributed statistical estimation and rates of convergence in normal approximation. arXiv preprint arXiv:1704.02658.
  33. Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4), 2308–2335.MathSciNetCrossRefzbMATHGoogle Scholar
  34. Murata, T., & Suzuki, T. (2016). Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems. arXiv preprint arXiv:1603.02412.
  35. Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  36. Nocedal, J., & Wright, S. (1999). Numerical optimization., Springer Series in Operations Research Berlin: Springer.CrossRefzbMATHGoogle Scholar
  37. Prasad, A., Suggala, A. S., Balakrishnan, S., & Ravikumar, P. (2018). Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485.
  38. Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th international conference on machine learning (pp. 449–456).Google Scholar
  39. Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14, 567–599.MathSciNetzbMATHGoogle Scholar
  40. Talvila, E. (2001). Necessary and sufficient conditions for differentiating under the integral sign. American Mathematical Monthly, 108(6), 544–548.MathSciNetCrossRefzbMATHGoogle Scholar
  41. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  42. Vardi, Y., & Zhang, C. H. (2000). The multivariate \(L_{1}\)-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4), 1423–1426.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Osaka UniversityIbarakiJapan
  2. 2.Nara Institute of Science and TechnologyIkomaJapan

Personalised recommendations