Empirically Explaining SGD from a Line Search Perspective

Mutschler, Maximus; Zell, Andreas

doi:10.1007/978-3-030-86340-1_37

Maximus Mutschler¹² &
Andreas Zell¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12892))

Included in the following conference series:

International Conference on Artificial Neural Networks

2083 Accesses

Abstract

Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gradient-only surrogate to resolve learning rates for robust and consistent training of deep neural networks

Article 15 October 2022

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

Article 26 January 2022

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Notes

1.
Better performance does not imply that the assumptions used are correct.
2.
Image classification on MNIST, SVHN, CIFAR-10, CIFAR-100 and ImageNet.
3.
See the GitHub link in Sect. 7 for further analyses and code. We are aware that our analysis of a small set of problems provides low evidence. Nevertheless, we consider it to be guiding. With the code published with this paper, it is simple to run our experiments on further problems.
4.
Cropping, horizontal flipping and normalization with mean and standard deviation.
5.
Best performing \(\lambda \) chosen of a grid search over \(\{10^{-i}| i \in \{0,1,1.3,2,3,4\}\}\).
6.
Note that we have done the same evaluation for a ResNet-18 [8] and a MobileNetV2 [24] trained on the same data and obtained results supporting our claims. See GitHub link.

References

Berrada, L., Zisserman, A., Kumar, M.P.: Training neural networks for and by interpolation. In: ICML (2020)
Google Scholar
Chae, Y., Wilke, D.N.: Empirical study towards understanding line search approximations for training neural networks. arXiv (2019)
Google Scholar
De, S., Yadav, A.K., Jacobs, D.W., Goldstein, T.: Big batch SGD: automated inference using adaptive batch sizes. arXiv (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Draxler, F., Veschgini, K., Salmhofer, M., Hamprecht, F.A.: Essentially no barriers in neural network energy landscape. In: ICML (2018)
Google Scholar
Fort, S., Jastrzebski, S.: Large scale structure of neural network loss landscapes. In: NeurIPS (2019)
Google Scholar
Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems. In: ICLR (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Simplifying neural nets by discovering flat minima. In: NeurIPS (1994)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Jastrzebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., Storkey, A.J.: On the relation between the sharpest directions of DNN loss and the SGD step length. In: ICLR (2019)
Google Scholar
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR (2017)
Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Google Scholar
Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets. In: NeurIPS (2018)
Google Scholar
Li, X., Gu, Q., Zhou, Y., Chen, T., Banerjee, A.: Hessian based analysis of SGD for deep nets: dynamics and generalization. In: SDM21 (2020)
Google Scholar
Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18(1), 4262–4320 (2017)
Google Scholar
McCandlish, S., Kaplan, J., Amodei, D., Team, O.D.: An empirical model of large-batch training. arXiv (2018)
Google Scholar
Mutschler, M., Zell, A.: Parabolic approximation line search for dnns. In: NeurIPS (2020)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rolinek, M., Martius, G.: L4: Practical loss-based stepsize adaptation for deep learning. In: NeurIPS (2018)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Google Scholar
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV (2017)
Google Scholar
Smith, S.L., Kindermans, P., Ying, C., Le, Q.V.: Don’t decay the learning rate, increase the batch size. In: ICLR (2018)
Google Scholar
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: NeurIPS (2019)
Google Scholar
Xing, C., Arpit, D., Tsirigotis, C., Bengio, Y.: A walk with sgd. arXiv (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Tübingen, Sand 1, 72076, Tübingen, Germany
Maximus Mutschler & Andreas Zell

Authors

Maximus Mutschler
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Zell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Maximus Mutschler or Andreas Zell .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

8 Appendix

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mutschler, M., Zell, A. (2021). Empirically Explaining SGD from a Line Search Perspective. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12892. Springer, Cham. https://doi.org/10.1007/978-3-030-86340-1_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-86340-1_37
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86339-5
Online ISBN: 978-3-030-86340-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics