Skip to main content
Log in

DiffMoment: an adaptive optimization technique for convolutional neural network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Stochastic Gradient Decent (SGD) is a very popular basic optimizer applied in the learning algorithms of deep neural networks. However, it has fixed-sized steps for every epoch without considering gradient behaviour to determine step size. The improved SGD optimizers like AdaGrad, Adam, AdaDelta, RAdam, and RMSProp make step sizes adaptive in every epoch. However, these optimizers depend on square roots of exponential moving averages (EMA) of squared previous gradients or momentums or both and cannot take the benefit of local change in gradients or momentums or both. To reduce these limitations, a novel optimizer has been presented in this paper where the adjustment of step size is done for each parameter based on changing information between the 1st and the 2nd moment estimate (i.e., diffMoment). The experimental results depict that diffMoment offers better performance than AdaGrad, Adam, AdaDelta, RAdam, and RMSProp optimizers. It is also noticed that diffMoment does uniformly better for training Convolutional Neural Networks (CNN) applying different activation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. In: Neurocomputing:Foundations of research. http://dl.acm.org/citation.cfm?id=65669.104451. MIT Press, Cambridge, pp 696–699

  2. Mirjalili S (2015) How effective is the Grey Wolf optimizer in training multi-layer perceptrons. Appl Intell 43:150–161. https://doi.org/10.1007/s10489-014-0645-7

    Article  Google Scholar 

  3. Faris H, Aljarah I, Mirjalili S (2016) Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl Intell 45:322–332. https://doi.org/10.1007/s10489-016-0767-1

    Article  Google Scholar 

  4. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–44. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  5. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://www.deeplearningbook.org

  6. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, pp 580–587, DOI https://doi.org/10.1109/CVPR.2014.81, (to appear in print)

  7. Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448, DOI https://doi.org/10.1109/ICCV.2015.169, (to appear in print)

  8. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  9. Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc. pp 1097–1105

  10. Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151. http://dblp.uni-trier.de/db/journals/nn/nn12.html#Qian99

    Article  Google Scholar 

  11. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(61):2121–2159. http://jmlr.org/papers/v12/duchi11a.html

    MathSciNet  MATH  Google Scholar 

  12. Kingma D P, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3Rd international conference on learning representations, ICLR 2015, San diego, May 7-9, p 2015

  13. Dogo EM, Afolabi OJ, Nwulu NI, Twala B, Aigbavboa CO (2018) A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. In: 2018 International conference on computational techniques, electronics and mechanical systems (CTEMS), pp 92–99, DOI https://doi.org/10.1109/CTEMS.2018.8769211, (to appear in print)

  14. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD, Heidelberg, pp 177–186

  15. Robbins H, Monro S (1951) A stochastic approximation method. Annals Math Stat 22 (3):400–407. http://www.jstor.org/stable/2236626

    Article  MathSciNet  MATH  Google Scholar 

  16. Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Vol 28. JMLR.org, ICML’13, p III–1139–III–1147

  17. Botev A, Lever G, Barber D (2017) Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In: 2017 International joint conference on neural networks (IJCNN), pp 1899–1903, DOI https://doi.org/10.1109/IJCNN.2017.7966082https://doi.org/10.1109/IJCNN.2017.7966082 , (to appear in print)

  18. Lydia A, Francis S (2019) Adagrad - an optimizer for stochastic gradient descent vol 6. pp 566–568

  19. Fang JK, Fong CM, Yang P, Hung CK, Lu WL, Chang CW (2020) Adagrad gradient descent method for ai image management. In: 2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan), pp 1–2, DOI https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258085, (to appear in print)

  20. Zeiler M D (2012) ADADELTA: An adaptive learning rate method. CoRR:1212.5701

  21. Hinton G, Srivastava KSN (2012) Lecture 6a overview of mini-batch gradient descent course. In: Neural networks for machine learning

  22. Dorronsoro JR, Gonzalez A, Cruz CS (2002) Natural Gradient Learning in NLDA Networks, Conference on Artificial and Natural Neural Networks: Connectionist Models of Neurons, Learning Processes and Artificial Intelligence-Part-I, pp 427–434. https://dl.acm.org/doi/10.5555/646369.690794

  23. Zhang J (2019) Gradient descent based optimization algorithms for deep learning models training. CoRR: 1903.03614. https://doi.org/10.1007/978-3-030-33904-3_40

  24. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 30. https://proceedings.neurips.cc/paper/2017/file/81b3833e2504647f9d794f7d7b9bf341-Paper.pdf

  25. Lyu K, Li J (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International conference on learning representations. https://openreview.net/forum?id=SJeLIgBKPS

  26. Zhuang J, Tang T, Ding Y, Tatikonda S, Dvornek N C, Papademetris X, Duncan JS (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. CoRR: 2010.07468

  27. Defazio A, Jelassi S (2021) Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. 2101.11075

  28. Dubey SR, Chakraborty S, Roy SK, Mukherjee S, Singh SK, Chaudhuri BB (2020) diffgrad: An optimization method for convolutional neural networks. IEEE Trans Neural Netw Learn Syst 31 (11):4500–4511. https://doi.org/10.1109/TNNLS.2019.2955777

    Article  MathSciNet  Google Scholar 

  29. Yong H, Huang J, Hua X, Zhang L (2020) Gradient-centralization: A new optimization technique for deep neural networks. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer Vision ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12346. https://doi.org/10.1007/978-3-030-58452-8_37. Springer, Cham

  30. Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: Proceedings of the Eighth International Conference on Learning Representations (ICLR), vol 2020. pp 1–13

  31. Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Tront, Master’s thesis

    Google Scholar 

  32. Rosenbrock HH (1960) An Automatic Method for Finding the Greatest or Least Value of a Function. Comput J 3(3):175–184. https://doi.org/10.1093/comjnl/3.3.175. https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/3.3.175

    Article  MathSciNet  Google Scholar 

  33. Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 27. https://proceedings.neurips.cc/paper/2014/file/17e23e50bedc63b4095e3d8204ce063b-Paper.pdf

  34. Lacotte J, Pilanci M (2020) All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape. 2006.05900

  35. Kawaguchi K, Kaelbling LP (2020) Elimination of all bad local minima in deep learning. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. PMLR,108, pp 853–863

  36. Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3-4):142–336. https://doi.org/10.1561/2200000058

    Article  MATH  Google Scholar 

  37. Danilova M, Dvurechensky PE, Gasnikov AV, Gorbunov EA, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. CoRR:2012.06188. https://dblp.uni-trier.de/rec/journals/corr/abs-2012-06188.html

  38. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90, (to appear in print)

  39. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)

  40. Liu S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp 730–734, DOI https://doi.org/10.1109/ACPR.2015.7486599, (to appear in print)

  41. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)

  42. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings Of the 27th international conference on machine learning, Haifa, 21 June 2010, pp 807–814

  43. Andrew L, Maas AYN, Hannun AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the international conference on machine learning, pp 1–6

  44. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pp 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification.pdf

  45. Huang G, Liu Z, van der Maaten L, Weinberger K (2017) Densely Connected Convolutional Networks. https://doi.org/10.1109/CVPR.2017.243

  46. LecunBottou Y, et al. (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol 88. no. 11

  47. Deng L (2012) /E MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Proc Mag 29(6):141–142

    Article  Google Scholar 

  48. Amari S-i (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276

    Article  Google Scholar 

  49. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol 30. pp 16–21

Download references

Acknowledgements

We’d like to thank to the Dept. of Computer Science, Vidyasagar University, Paschim Medinipur for providing infrastructure.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Utpal Nandi.

Ethics declarations

Conflict of Interests

The authors have no potential conflict of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhakta, S., Nandi, U., Si, T. et al. DiffMoment: an adaptive optimization technique for convolutional neural network. Appl Intell 53, 16844–16858 (2023). https://doi.org/10.1007/s10489-022-04382-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04382-7

Keywords

Navigation