Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Jastrzębski, Stanislaw; Kenton, Zachary; Arpit, Devansh; Ballas, Nicolas; Fischer, Asja; Bengio, Yoshua; Storkey, Amos

doi:10.1007/978-3-030-01424-7_39

Stanislaw Jastrzębski^18,19,20,
Zachary Kenton^18,19,
Devansh Arpit¹⁹,
Nicolas Ballas²⁰,
Asja Fischer²¹,
Yoshua Bengio^19,22 &
…
Amos Storkey²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11141))

Included in the following conference series:

International Conference on Artificial Neural Networks

9053 Accesses
7 Citations

Abstract

We show that the dynamics and convergence properties of SGD are set by the ratio of learning rate to batch size. We observe that this ratio is a key determinant of the generalization error, which we suggest is mediated by controlling the width of the final minima found by SGD. We verify our analysis experimentally on a range of deep neural networks and datasets.

S. Jastrzębski and Z. Kenton—Equally contributed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See [12] for a different SDE which also has a discretization equivalent to SGD.
2.
See e.g. [9].
3.
For a more formal analysis, not requiring central limit arguments, see an alternative approach [11] which also considers SGD as a discretization of an SDE. Note that the batch size is not present there.
4.
Including the paths of the dynamics, the equilibria, the shape of the learning curves.
5.
We have adapted the final layers to be compatible with the CIFAR10 dataset.
6.
Each experiment was repeated for 5 different random initializations.
7.
We used a adaptive learning rate schedule with \(\eta \) dropping by a factor of 10 on epochs 60, 100, 140, 180 for ResNet56 and by a factor of 2 every 25 epochs for VGG11.
8.
Each experiment was run for 200 epochs in which most models reached an accuracy of almost \(100\%\) on the training set.
9.
Assuming the network has enough capacity.
10.
Experiments are repeated 5 times with different random seeds. The graphs denote the mean validation accuracies and the numbers in the brackets denote the mean and standard deviation of the maximum validation accuracy across different runs. The * denotes at least one seed diverged.
11.
This holds approximately, in the limit of small batch size compared to training set size.

References

Advani, M.S., Saxe, A.M.: High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017)
Arpit, D., et al.: A closer look at memorization in deep networks. In: ICML (2017)
Google Scholar
Bottou, L.: Online learning and stochastic approximations. On-line Learn. Neural netw. 17(9), 142 (1998)
MATH Google Scholar
Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv:1710.11029 (2017)
Goyal, P., et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv e-prints (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)
Article Google Scholar
Hoffer, E., et al.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. ArXiv e-prints, arxiv:1705.08741
Junchi Li, C., et al.: Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent. ArXiv e-prints (2017)
Google Scholar
Kloeden, P.E., Platen, E.: Numerical Solution of Stochastic Differential Equations. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-662-12616-5
Book Google Scholar
Kushner, H., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications (Stochastic Modelling and Applied Probability) (v. 35), 2nd edn. Springer (2003). http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387008942
Li, Q., Tai, C., E., W.: Stochastic modified equations and adaptive stochastic gradient algorithms. In: Proceedings of the 34th ICML (2017)
Google Scholar
Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18, 134:1–134:35 (2017)
Google Scholar
Poggio, T., et al.: Theory of Deep Learning III: explaining the non-overfitting puzzle. ArXiv e-prints, ArXiv e-prints, arxiv:1801.00173 (2018)
Sagun, L., Evci, U., Ugur Guney, V., Dauphin, Y., Bottou, L.: Empirical Analysis Of The Hessian Of Over-parametrized Neural Networks. ArXiv e-prints (2017)
Google Scholar
Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556 (2014)
Smith, S., Le, Q.: Understanding generalization and stochastic gradient descent. arXiv preprint, arXiv:1710.06451 (2017)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th ICML, pp. 681–688 (2011)
Google Scholar
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. ArXiv e-prints (2017)
Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint, arXiv:1611.03530 (2016)
Zhu, Z., Wu, J., Yu, B., Wu, L., Ma, J.: The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent. ArXiv e-prints (2018)
Google Scholar

Download references

Acknowledgements

We thank NSERC, Canada Research Chairs, IVADO and CIFAR for funding. SJ was in part supported by Grant No. DI 2014/016644 and ETIUDA stipend No. 2017/24/T/ST6/00487. This project has received funding from the European Union’s Horizon 2020 programme under grant agreement No 732204 and Swiss State Secretariat for Education,Research and Innovation under contract No. 16.0159.

Author information

Authors and Affiliations

Jagiellonian University, Kraków, Poland
Stanislaw Jastrzębski & Zachary Kenton
MILA, Université de Montréal, Montreal, Canada
Stanislaw Jastrzębski, Zachary Kenton, Devansh Arpit & Yoshua Bengio
Facebook AI Research, Paris, France
Stanislaw Jastrzębski & Nicolas Ballas
Faculty of Mathematics, Ruhr-University Bochum, Bochum, Germany
Asja Fischer
CIFAR Senior Fellow, Toronto, Canada
Yoshua Bengio
School of Informatics, University of Edinburgh, Edinburgh, Scotland
Amos Storkey

Authors

Stanislaw Jastrzębski
View author publications
You can also search for this author in PubMed Google Scholar
Zachary Kenton
View author publications
You can also search for this author in PubMed Google Scholar
Devansh Arpit
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Ballas
View author publications
You can also search for this author in PubMed Google Scholar
Asja Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar
Amos Storkey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stanislaw Jastrzębski .

Editor information

Editors and Affiliations

Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Open University of Cyprus, Latsia, Cyprus
Yannis Manolopoulos
CITEC Bielefeld University, Bielefeld, Germany
Barbara Hammer
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jastrzębski, S. et al. (2018). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-01424-7_39
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01423-0
Online ISBN: 978-3-030-01424-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics