Skip to main content
Log in

A fast adaptive algorithm for training deep neural networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Among the adaptive algorithms, Adam is the most widely used algorithm, especially for training deep neural networks. However, recent studies have shown that it has a weak generalization ability, and even cannot converge in extreme cases. AdaX (2020) is a variant of Adam, which modifies the second moment of Adam, making the algorithm enjoy good generalization ability compared to SGD. This work aims to improve the AdaX algorithm with faster convergence speed and higher training accuracy. The first moment of AdaX is essentially a classical momentum term, while the Nesterov’s accelerated gradient (NAG) is theoretically and experimentally superior to this classical momentum. Therefore, we replace the classical momentum term of the first moment of AdaX with NAG, and obtain the resulting algorithm named Nesterov’s accelerated AdaX (Nadax). Extensive experiments on deep learning tasks show that training models with our proposed Nadax can bring favorable benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Li W, Zhang Z, Wang X, Adax PL (2020) Adaptive gradient descent with exponential long term memory. arXiv:2004.09740

  2. Sharma N, Jain V, Mishra A (2018) An analysis of convolutional neural networks for image classification. Procedia Comput Sci 132:377–384

    Article  Google Scholar 

  3. Zhao W, Lou M, Qi Y, Wang Y, Xu C, Deng X, Ma. Y (2021) Adaptive channel and multiscale spatial context network for breast mass segmentation in full-field mammograms. Applied Intelligence 51(12):8810–8827

    Article  Google Scholar 

  4. Tian P, Mo H, Jiang L (2021) Scene graph generation by multi-level semantic tasks. Applied Intelligence, 51(11):7781–7793

    Article  Google Scholar 

  5. Anup KG, Puneet G, Esa R (2021) Fatalread-fooling visual speech recognition models

  6. Robbins H, Monro S (1951) A stochastic approximation method. The annals of mathematical statistics pages 400–407

  7. Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady an ussr, 269:543–547

    Google Scholar 

  8. Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pages 1139–1147. PMLR

  9. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12:7

    MATH  Google Scholar 

  10. Matthew DZ (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701

  11. Tijmen T., Geoffrey H., et al. (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26– 31

    Google Scholar 

  12. Kingma DP, Adam JB (2014) A method for stochastic optimization. arXiv:1412.6980

  13. Timothy D (2016) Incorporating nesterov momentum into adam

  14. Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond. arXiv:1904.09237

  15. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning

  16. Luo L, Xiong Y, Liu Y, Sun XU (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv:1902.09843

  17. Boris TP (1964) Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics 4(5):1–17

    Article  Google Scholar 

  18. Zhuang J, Tang T, Ding Y, Tatikonda SC , Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Adv Neural Inf Process Syst 33:18795–18806

    Google Scholar 

  19. Hazan E (2019) Introduction to online convex optimization. arXiv:1909.05207

  20. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning (icml-03), 928–936

  21. LeCun Y (1998) The mnist database of handwritten digits. http://yann.lecuncom/exdb/mnist/

  22. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101

  23. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747

  24. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition

  25. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images

  26. Everingham M, Eslami SM , Gool LV, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1):98–136

    Article  Google Scholar 

  27. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition 3431–3440

  28. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  29. Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets

Download references

Acknowledgements

This work is supported in part by the Natural Science Foundation of China under Grant No. 61472003, Academic and Technical Leaders and Backup Candidates of Anhui Province under Grant No. 2019h211, Innovation team of ’50 Star of Science and Technology’ of Huainan, Anhui Province.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dequan Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gui, Y., Li, D. & Fang, R. A fast adaptive algorithm for training deep neural networks. Appl Intell 53, 4099–4108 (2023). https://doi.org/10.1007/s10489-022-03629-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03629-7

Keywords

Navigation