Learning dynamics of gradient descent optimization in deep neural networks

Abstract

Stochastic gradient descent (SGD)-based optimizers play a key role in most deep learning models, yet the learning dynamics of the complex model remain obscure. SGD is the basic tool to optimize model parameters, and is improved in many derived forms including SGD momentum and Nesterov accelerated gradient (NAG). However, the learning dynamics of optimizer parameters have seldom been studied. We propose to understand the model dynamics from the perspective of control theory. We use the status transfer function to approximate parameter dynamics for different optimizers as the first- or second-order control system, thus explaining how the parameters theoretically affect the stability and convergence time of deep learning models, and verify our findings by numerical experiments.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Ruder S. An overview of gradient descent optimization algorithms. 2016. ArXiv:1609.04747

  2. 2

    An W P, Wang H Q, Sun Q Y, et al. A PID controller approach for stochastic optimization of deep networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8522–8531

  3. 3

    Kim D, Kim J, Kwon J, et al. Depth-controllable very deep super-resolution network. In: Proceedings of International Joint Conference on Neural Networks, 2019. 1–8

  4. 4

    Hinton G, Srivastava N, Swersky K. Overview of mini-batch gradient descent. 2012. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

  5. 5

    Qian N. On the momentum term in gradient descent learning algorithms. Neural Netw, 1999, 12: 145–151

    Article  Google Scholar 

  6. 6

    Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121–2159

    MathSciNet  MATH  Google Scholar 

  7. 7

    Zeiler M D. Adadelta: an adaptive learning rate method. 2012. ArXiv:1212.5701

  8. 8

    Dauphin Y N, de Vries H, Bengio Y. Equilibrated adaptive learning rates for nonconvex optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2015

  9. 9

    Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. 1–15

  10. 10

    Reddi S J, Kale S, Kumar S. On the convergence of ADAM and beyond. In: Proceedings of International Conference on Learning Representations, 2018. 1–23

  11. 11

    Luo L C, Xiong Y H, Liu Y, et al. Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of International Conference on Learning Representations, 2019. 1–19

  12. 12

    Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2013. ArXiv:1312.6120

  13. 13

    Lee T H, Trinh H M, Park J H. Stability analysis of neural networks with time-varying delay by constructing novel Lyapunov functionals. IEEE Trans Neural Netw Learn Syst, 2018, 29: 4238–4247

    Article  Google Scholar 

  14. 14

    Faydasicok O, Arik S. A novel criterion for global asymptotic stability of neutral type neural networks with discrete time delays. In: Proceedings of International Conference on Neural Information Processing, 2018. 353–360

  15. 15

    Vidal R, Bruna J, Giryes R, et al. Mathematics of deep learning. 2017. ArXiv:1712.04741

  16. 16

    Chaudhari P, Oberman A, Osher S, et al. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci, 2018, 5: 30

    MathSciNet  Article  Google Scholar 

  17. 17

    Wang H Q, Luo Y, An W P, et al. PID controller-based stochastic optimization acceleration for deep neural networks. IEEE Trans Neural Netw Learn Syst, 2020, 31: 5079–5091

    Article  Google Scholar 

  18. 18

    Cousseau F, Ozeki T, Amari S. Dynamics of learning in multilayer perceptrons near singularities. IEEE Trans Neural Netw, 2008, 19: 1313–1328

    Article  Google Scholar 

  19. 19

    Amari S, Park H, Ozeki T. Singularities affect dynamics of learning in neuromanifolds. Neural Comput, 2006, 18: 1007–1065

    MathSciNet  Article  Google Scholar 

  20. 20

    Bietti A, Mairal J. Group invariance, stability to deformations, and complexity of deep convolutional representations. J Mach Learn Res, 2019, 20: 876–924

    MathSciNet  MATH  Google Scholar 

  21. 21

    Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139–1147

  22. 22

    Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278–2324

    Article  Google Scholar 

  23. 23

    Li L S, Jamieson K, DeSalvo G, et al. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res, 2018, 18: 1–52

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 61933013, U1736211), Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA22030301), Natural Science Foundation of Guangdong Province (Grant No. 2019A1515011076), and Key Project of Natural Science Foundation of Hubei Province (Grant No. 2018CFA024).

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Wei Wu or Xiaoyuan Jing.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wu, W., Jing, X., Du, W. et al. Learning dynamics of gradient descent optimization in deep neural networks. Sci. China Inf. Sci. 64, 150102 (2021). https://doi.org/10.1007/s11432-020-3163-0

Download citation

Keywords

  • learning dynamics
  • deep neural networks
  • gradient descent
  • control model
  • transfer function