Skip to main content

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2018 (ICANN 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11141))

Included in the following conference series:

Abstract

We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size. We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD. We verify our analysis experimentally on a range of deep neural networks and datasets.

S. Jastrzębski and Z. Kenton—Equally contributed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See [12] for a different SDE which also has a discretization equivalent to SGD.

  2. 2.

    See e.g. [9].

  3. 3.

    For a more formal analysis, not requiring central limit arguments, see an alternative approach [11] which also considers SGD as a discretization of an SDE. Note that the batch size is not present there.

  4. 4.

    Including the paths of the dynamics, the equilibria, the shape of the learning curves.

  5. 5.

    We have adapted the final layers to be compatible with the CIFAR10 dataset.

  6. 6.

    Each experiment was repeated for 5 different random initializations.

  7. 7.

    We used a adaptive learning rate schedule with \(\eta \) dropping by a factor of 10 on epochs 60, 100, 140, 180 for ResNet56 and by a factor of 2 every 25 epochs for VGG11.

  8. 8.

    Each experiment was run for 200 epochs in which most models reached an accuracy of almost \(100\%\) on the training set.

  9. 9.

    Assuming the network has enough capacity.

  10. 10.

    Experiments are repeated 5 times with different random seeds. The graphs denote the mean validation accuracies and the numbers in the brackets denote the mean and standard deviation of the maximum validation accuracy across different runs. The * denotes at least one seed diverged.

  11. 11.

    This holds approximately, in the limit of small batch size compared to training set size.

References

  1. Advani, M.S., Saxe, A.M.: High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017)

  2. Arpit, D., et al.: A closer look at memorization in deep networks. In: ICML (2017)

    Google Scholar 

  3. Bottou, L.: Online learning and stochastic approximations. On-line Learn. Neural netw. 17(9), 142 (1998)

    MATH  Google Scholar 

  4. Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv:1710.11029 (2017)

  5. Goyal, P., et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints (2017)

    Google Scholar 

  6. Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)

    Article  Google Scholar 

  7. Hoffer, E., et al.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. ArXiv e-prints, arxiv:1705.08741

  8. Junchi Li, C., et al.: Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent. ArXiv e-prints (2017)

    Google Scholar 

  9. Kloeden, P.E., Platen, E.: Numerical Solution of Stochastic Differential Equations. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-662-12616-5

    Book  Google Scholar 

  10. Kushner, H., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications (Stochastic Modelling and Applied Probability) (v. 35), 2nd edn. Springer (2003). http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387008942

  11. Li, Q., Tai, C., E., W.: Stochastic modified equations and adaptive stochastic gradient algorithms. In: Proceedings of the 34th ICML (2017)

    Google Scholar 

  12. Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 134:1–134:35 (2017)

    Google Scholar 

  13. Poggio, T., et al.: Theory of Deep Learning III: explaining the non-overfitting puzzle. ArXiv e-prints, ArXiv e-prints, arxiv:1801.00173 (2018)

  14. Sagun, L., Evci, U., Ugur Guney, V., Dauphin, Y., Bottou, L.: Empirical Analysis Of The Hessian Of Over-parametrized Neural Networks. ArXiv e-prints (2017)

    Google Scholar 

  15. Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints (2016)

    Google Scholar 

  16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556 (2014)

  17. Smith, S., Le, Q.: Understanding generalization and stochastic gradient descent. arXiv preprint, arXiv:1710.06451 (2017)

  18. Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th ICML, pp. 681–688 (2011)

    Google Scholar 

  19. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv e-prints (2017)

    Google Scholar 

  20. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint, arXiv:1611.03530 (2016)

  21. Zhu, Z., Wu, J., Yu, B., Wu, L., Ma, J.: The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent. ArXiv e-prints (2018)

    Google Scholar 

Download references

Acknowledgements

We thank NSERC, Canada Research Chairs, IVADO and CIFAR for funding. SJ was in part supported by Grant No. DI 2014/016644 and ETIUDA stipend No. 2017/24/T/ST6/00487. This project has received funding from the European Union’s Horizon 2020 programme under grant agreement No 732204 and Swiss State Secretariat for Education,Research and Innovation under contract No. 16.0159.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stanislaw Jastrzębski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jastrzębski, S. et al. (2018). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01424-7_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01423-0

  • Online ISBN: 978-3-030-01424-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics