Advertisement

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

  • Stanislaw Jastrzębski
  • Zachary Kenton
  • Devansh Arpit
  • Nicolas Ballas
  • Asja Fischer
  • Yoshua Bengio
  • Amos Storkey
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11141)

Abstract

We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size. We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD. We verify our analysis experimentally on a range of deep neural networks and datasets.

Notes

Acknowledgements

We thank NSERC, Canada Research Chairs, IVADO and CIFAR for funding. SJ was in part supported by Grant No. DI 2014/016644 and ETIUDA stipend No. 2017/24/T/ST6/00487. This project has received funding from the European Union’s Horizon 2020 programme under grant agreement No 732204 and Swiss State Secretariat for Education,Research and Innovation under contract No. 16.0159.

References

  1. 1.
    Advani, M.S., Saxe, A.M.: High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017)
  2. 2.
    Arpit, D., et al.: A closer look at memorization in deep networks. In: ICML (2017)Google Scholar
  3. 3.
    Bottou, L.: Online learning and stochastic approximations. On-line Learn. Neural netw. 17(9), 142 (1998)zbMATHGoogle Scholar
  4. 4.
    Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv:1710.11029 (2017)
  5. 5.
    Goyal, P., et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints (2017)Google Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)CrossRefGoogle Scholar
  7. 7.
    Hoffer, E., et al.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. ArXiv e-prints, arxiv:1705.08741
  8. 8.
    Junchi Li, C., et al.: Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent. ArXiv e-prints (2017)Google Scholar
  9. 9.
    Kloeden, P.E., Platen, E.: Numerical Solution of Stochastic Differential Equations. Springer, Heidelberg (1992).  https://doi.org/10.1007/978-3-662-12616-5CrossRefGoogle Scholar
  10. 10.
    Kushner, H., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications (Stochastic Modelling and Applied Probability) (v. 35), 2nd edn. Springer (2003). http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387008942
  11. 11.
    Li, Q., Tai, C., E., W.: Stochastic modified equations and adaptive stochastic gradient algorithms. In: Proceedings of the 34th ICML (2017)Google Scholar
  12. 12.
    Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 134:1–134:35 (2017)Google Scholar
  13. 13.
    Poggio, T., et al.: Theory of Deep Learning III: explaining the non-overfitting puzzle. ArXiv e-prints, ArXiv e-prints, arxiv:1801.00173 (2018)
  14. 14.
    Sagun, L., Evci, U., Ugur Guney, V., Dauphin, Y., Bottou, L.: Empirical Analysis Of The Hessian Of Over-parametrized Neural Networks. ArXiv e-prints (2017)Google Scholar
  15. 15.
    Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints (2016)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556 (2014)
  17. 17.
    Smith, S., Le, Q.: Understanding generalization and stochastic gradient descent. arXiv preprint, arXiv:1710.06451 (2017)
  18. 18.
    Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th ICML, pp. 681–688 (2011)Google Scholar
  19. 19.
    Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv e-prints (2017)Google Scholar
  20. 20.
    Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint, arXiv:1611.03530 (2016)
  21. 21.
    Zhu, Z., Wu, J., Yu, B., Wu, L., Ma, J.: The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent. ArXiv e-prints (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Stanislaw Jastrzębski
    • 1
    • 2
    • 3
  • Zachary Kenton
    • 1
    • 2
  • Devansh Arpit
    • 2
  • Nicolas Ballas
    • 3
  • Asja Fischer
    • 4
  • Yoshua Bengio
    • 2
    • 5
  • Amos Storkey
    • 6
  1. 1.Jagiellonian UniversityKrakówPoland
  2. 2.MILAUniversité de MontréalMontrealCanada
  3. 3.Facebook AI ResearchParisFrance
  4. 4.Faculty of MathematicsRuhr-University BochumBochumGermany
  5. 5.CIFAR Senior FellowTorontoCanada
  6. 6.School of InformaticsUniversity of EdinburghEdinburghScotland

Personalised recommendations