Abstract
Modern neural networks can easily fit their training set perfectly. Surprisingly, despite being “overfit” in this way, they tend to generalize well to future data, thereby defying the classic bias–variance trade-off of machine learning theory. Of the many possible explanations, a prevalent one is that training by stochastic gradient descent (SGD) imposes an implicit bias that leads it to learn simple functions, and these simple functions generalize well. However, the specifics of this implicit bias are not well understood.
In this work, we explore the smoothness conjecture which states that SGD is implicitly biased towards learning functions that are smooth. We propose several measures to formalize the intuitive notion of smoothness, and we conduct experiments to determine whether SGD indeed implicitly optimizes for these measures. Our findings rule out the possibility that smoothness measures based on first-order derivatives are being implicitly enforced. They are supportive, though, of the smoothness conjecture for measures based on second-order derivatives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2016)
Advani, M.S., Saxe, A.M.: High-dimensional dynamics of generalization error in neural networks. CoRR abs/1710.03667 (2017). http://arxiv.org/abs/1710.03667
Bartlett, P.L., Foster, D.J., Telgarsky, M.: Spectrally-normalized margin bounds for neural networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2017)
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)
Caruana, R., Lawrence, S., Giles, C.L.: Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In: Conference on Neural Information Processing Systems (NeurIPS) (2000)
Drucker, H., Le Cun, Y.: Improving generalization performance using double backpropagation. IEEE Trans. Neural Netw. (T-NN) 3(6), 991–997 (1992)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Conference on Uncertainty in Artificial Intelligence (AISTATS) (2010)
Greff, K., Klein, A., Chovanec, M., Hutter, F., Schmidhuber, J.: The sacred infrastructure for computational research. In: Proceedings of the Python in Science Conferences-SciPy Conferences (2017)
Holopainen, R.: Smoothness under parameter changes: derivatives and total variation. In: Sound and Music Computing Conference (SMC) (2013)
Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., Bengio, S.: Fantastic generalization measures and where to find them. In: International Conference on Learning Representations (ICLR) (2020)
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. CoRR abs/1710.05468 (2017). http://arxiv.org/abs/1710.05468
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: Generalization gap and sharp minima. In: International Conference on Learning Representations (ICLR) (2017)
Kubo, M., Banno, R., Manabe, H., Minoji, M.: Implicit regularization in over-parameterized neural networks. CoRR abs/1903.01997 (2019). http://arxiv.org/abs/1903.01997
Lawrence, S., Gilesyz, C.L., Tsoi, A.C.: What size neural network gives optimal generalization? Convergence properties of backpropagation. Tech. rep., Institute for Advanced Computer Studies, University of Maryland (1996)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
LeJeune, D., Balestriero, R., Javadi, H., Baraniuk, R.G.: Implicit rugosity regularization via data augmentation. CoRR abs/1905.11639 (2019). http://arxiv.org/abs/1905.11639
Liang, T., Poggio, T.A., Rakhlin, A., Stokes, J.: Fisher-Rao metric, geometry, and complexity of neural networks. In: Conference on Uncertainty in Artificial Intelligence (AISTATS) (2019)
Maennel, H., Bousquet, O., Gelly, S.: Gradient descent quantizes ReLU network features. CoRR abs/1803.08367 (2018). http://arxiv.org/abs/1803.08367
Nagarajan, V., Kolter, J.Z.: Generalization in deep networks: the role of distance from initialization. In: Conference on Neural Information Processing Systems (NeurIPS) (2019)
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. In: International Conference on Learning Representations (ICLR) (2020)
Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. In: International Conference on Learning Representations (ICLR) (2014)
Neyshabur, B., Tomioka, R., Srebro, N.: Norm-based capacity control in neural networks. In: Workshop on Computational Learning Theory (COLT) (2015)
Novak, R., Bahri, Y., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J.: Sensitivity and generalization in neural networks: an empirical study. In: International Conference on Learning Representations (ICLR) (2018)
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: International Conference on Machine Learning (ICML) (2011)
Sokolic, J., Giryes, R., Sapiro, G., Rodrigues, M.R.D.: Robust large margin deep neural networks. IEEE Trans. Signal Process. (T-SP) 65(16), 4265–4280 (2017)
Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under- to over-parametrization affects generalization in deep learning. J. Phys. A: Math. Theor. 52(47), 474001(2019)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (ICLR) (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 112 KB)
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Volhejn, V., Lampert, C. (2021). Does SGD Implicitly Optimize for Smoothness?. In: Akata, Z., Geiger, A., Sattler, T. (eds) Pattern Recognition. DAGM GCPR 2020. Lecture Notes in Computer Science(), vol 12544. Springer, Cham. https://doi.org/10.1007/978-3-030-71278-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-71278-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71277-8
Online ISBN: 978-3-030-71278-5
eBook Packages: Computer ScienceComputer Science (R0)