Stochastic gradient descent (SGD)-based optimizers play a key role in most deep learning models, yet the learning dynamics of the complex model remain obscure. SGD is the basic tool to optimize model parameters, and is improved in many derived forms including SGD momentum and Nesterov accelerated gradient (NAG). However, the learning dynamics of optimizer parameters have seldom been studied. We propose to understand the model dynamics from the perspective of control theory. We use the status transfer function to approximate parameter dynamics for different optimizers as the first- or second-order control system, thus explaining how the parameters theoretically affect the stability and convergence time of deep learning models, and verify our findings by numerical experiments.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Ruder S. An overview of gradient descent optimization algorithms. 2016. ArXiv:1609.04747
An W P, Wang H Q, Sun Q Y, et al. A PID controller approach for stochastic optimization of deep networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8522–8531
Kim D, Kim J, Kwon J, et al. Depth-controllable very deep super-resolution network. In: Proceedings of International Joint Conference on Neural Networks, 2019. 1–8
Hinton G, Srivastava N, Swersky K. Overview of mini-batch gradient descent. 2012. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Qian N. On the momentum term in gradient descent learning algorithms. Neural Netw, 1999, 12: 145–151
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121–2159
Zeiler M D. Adadelta: an adaptive learning rate method. 2012. ArXiv:1212.5701
Dauphin Y N, de Vries H, Bengio Y. Equilibrated adaptive learning rates for nonconvex optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2015
Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. 1–15
Reddi S J, Kale S, Kumar S. On the convergence of ADAM and beyond. In: Proceedings of International Conference on Learning Representations, 2018. 1–23
Luo L C, Xiong Y H, Liu Y, et al. Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of International Conference on Learning Representations, 2019. 1–19
Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2013. ArXiv:1312.6120
Lee T H, Trinh H M, Park J H. Stability analysis of neural networks with time-varying delay by constructing novel Lyapunov functionals. IEEE Trans Neural Netw Learn Syst, 2018, 29: 4238–4247
Faydasicok O, Arik S. A novel criterion for global asymptotic stability of neutral type neural networks with discrete time delays. In: Proceedings of International Conference on Neural Information Processing, 2018. 353–360
Vidal R, Bruna J, Giryes R, et al. Mathematics of deep learning. 2017. ArXiv:1712.04741
Chaudhari P, Oberman A, Osher S, et al. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci, 2018, 5: 30
Wang H Q, Luo Y, An W P, et al. PID controller-based stochastic optimization acceleration for deep neural networks. IEEE Trans Neural Netw Learn Syst, 2020, 31: 5079–5091
Cousseau F, Ozeki T, Amari S. Dynamics of learning in multilayer perceptrons near singularities. IEEE Trans Neural Netw, 2008, 19: 1313–1328
Amari S, Park H, Ozeki T. Singularities affect dynamics of learning in neuromanifolds. Neural Comput, 2006, 18: 1007–1065
Bietti A, Mairal J. Group invariance, stability to deformations, and complexity of deep convolutional representations. J Mach Learn Res, 2019, 20: 876–924
Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147
Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278–2324
Li L S, Jamieson K, DeSalvo G, et al. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res, 2018, 18: 1–52
This work was supported by National Natural Science Foundation of China (Grant Nos. 61933013, U1736211), Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA22030301), Natural Science Foundation of Guangdong Province (Grant No. 2019A1515011076), and Key Project of Natural Science Foundation of Hubei Province (Grant No. 2018CFA024).
About this article
Cite this article
Wu, W., Jing, X., Du, W. et al. Learning dynamics of gradient descent optimization in deep neural networks. Sci. China Inf. Sci. 64, 150102 (2021). https://doi.org/10.1007/s11432-020-3163-0
- learning dynamics
- deep neural networks
- gradient descent
- control model
- transfer function