Efficient learning with robust gradient descent
- 519 Downloads
Minimizing the empirical risk is a popular training strategy, but for learning tasks where the data may be noisy or heavy-tailed, one may require many observations in order to generalize well. To achieve better performance under less stringent requirements, we introduce a procedure which constructs a robust approximation of the risk gradient for use in an iterative learning routine. Using high-probability bounds on the excess risk of this algorithm, we show that our update does not deviate far from the ideal gradient-based update. Empirical tests using both controlled simulations and real-world benchmark data show that in diverse settings, the proposed procedure can learn more efficiently, using less resources (iterations and observations) while generalizing better.
KeywordsRobust learning Stochastic optimization Statistical learning theory
This work was partially supported by the Grant-in-Aid for JSPS Research Fellows.
- Abramowitz, M., & Stegun, I. A. (1964). Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series, vol 55. US National Bureau of Standards.Google Scholar
- Catoni, O. (2009). High confidence estimates of the mean of heavy-tailed real random variables. arXiv preprint arXiv:0909.5366.
- Chen, Y., Su, L., & Xu, J. (2017a). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. arXiv preprint arXiv:1705.05491.
- Chen, Y., Su, L., & Xu, J. (2017b). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2), 44.Google Scholar
- Daniely, A., & Shalev-Shwartz, S. (2014). Optimal learners for multiclass problems. In 27th annual conference on learning theory, proceedings of machine learning research (vol. 35, pp. 287–316).Google Scholar
- Devroye, L., Lerasle, M., Lugosi, G., & Oliveira, R. I. (2015). Sub-Gaussian mean estimators. arXiv preprint arXiv:1509.05845.
- Feldman, V. (2016). Generalization of ERM in stochastic convex optimization: The dimension strikes back. Advances in Neural Information Processing Systems, 29, 3576–3584.Google Scholar
- Finkenstädt, B., & Rootzén, H. (Eds.). (2003). Extreme values in finance, telecommunications, and the environment. Boca Raton: CRC Press.Google Scholar
- Frostig, R., Ge, R., Kakade, S. M., & Sidford, A. (2015). Competing with the empirical risk minimizer in a single pass. arXiv preprint arXiv:1412.6606.
- Holland, M. J., & Ikeda, K. (2017a). Efficient learning with robust gradient descent. arXiv preprint arXiv:1706.00182.
- Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 315–323.Google Scholar
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kolmogorov, A. N. (1993). \(\varepsilon \)-entropy and \(\varepsilon \)-capacity of sets in functional spaces. In A. N. Shiryayev (Ed.), Selected works of A. N. Kolmogorov, volume III: Information theory and the theory of algorithms (pp. 86–170). Berlin: Springer.Google Scholar
- Le Roux, N., Schmidt, M., & Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 2663–2671.Google Scholar
- Lecué, G., & Lerasle, M.(2017). Learning from MOM’s principles. arXiv preprint arXiv:1701.01961.
- Lecué, G., Lerasle, M., & Mathieu, T. (2018). Robust classification via MOM minimization. arXiv preprint arXiv:1808.03106.
- Lerasle, M., & Oliveira, R. I. (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
- Lin, J., & Rosasco, L. (2016). Optimal learning for multi-pass stochastic gradient methods. Advances in Neural Information Processing Systems, 29, 4556–4564.Google Scholar
- Lugosi, G., & Mendelson, S. (2016). Risk minimization by median-of-means tournaments. arXiv preprint arXiv:1608.00757.
- Minsker, S., & Strawn, N. (2017). Distributed statistical estimation and rates of convergence in normal approximation. arXiv preprint arXiv:1704.02658.
- Murata, T., & Suzuki, T. (2016). Stochastic dual averaging methods using variance reduction techniques for regularized empirical risk minimization problems. arXiv preprint arXiv:1603.02412.
- Prasad, A., Suggala, A. S., Balakrishnan, S., & Ravikumar, P. (2018). Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485.
- Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th international conference on machine learning (pp. 449–456).Google Scholar