DiffMoment: an adaptive optimization technique for convolutional neural network

Bhakta, Shubhankar; Nandi, Utpal; Si, Tapas; Ghosal, Sudipta Kr; Changdar, Chiranjit; Pal, Rajat Kumar

doi:10.1007/s10489-022-04382-7

DiffMoment: an adaptive optimization technique for convolutional neural network

Published: 17 December 2022

Volume 53, pages 16844–16858, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Shubhankar Bhakta¹,
Utpal Nandi ORCID: orcid.org/0000-0002-9638-1906¹,
Tapas Si²,
Sudipta Kr Ghosal³,
Chiranjit Changdar⁴ &
…
Rajat Kumar Pal⁵

362 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Stochastic Gradient Decent (SGD) is a very popular basic optimizer applied in the learning algorithms of deep neural networks. However, it has fixed-sized steps for every epoch without considering gradient behaviour to determine step size. The improved SGD optimizers like AdaGrad, Adam, AdaDelta, RAdam, and RMSProp make step sizes adaptive in every epoch. However, these optimizers depend on square roots of exponential moving averages (EMA) of squared previous gradients or momentums or both and cannot take the benefit of local change in gradients or momentums or both. To reduce these limitations, a novel optimizer has been presented in this paper where the adjustment of step size is done for each parameter based on changing information between the 1^st and the 2^nd moment estimate (i.e., diffMoment). The experimental results depict that diffMoment offers better performance than AdaGrad, Adam, AdaDelta, RAdam, and RMSProp optimizers. It is also noticed that diffMoment does uniformly better for training Convolutional Neural Networks (CNN) applying different activation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

emapDiffP: A novel learning algorithm for convolutional neural network optimization

Article 18 April 2024

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Moment Centralization-Based Gradient Descent Optimizers for Convolutional Neural Networks

References

Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. In: Neurocomputing:Foundations of research. http://dl.acm.org/citation.cfm?id=65669.104451. MIT Press, Cambridge, pp 696–699
Mirjalili S (2015) How effective is the Grey Wolf optimizer in training multi-layer perceptrons. Appl Intell 43:150–161. https://doi.org/10.1007/s10489-014-0645-7
Article Google Scholar
Faris H, Aljarah I, Mirjalili S (2016) Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl Intell 45:322–332. https://doi.org/10.1007/s10489-016-0767-1
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–44. https://doi.org/10.1038/nature14539
Article Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://www.deeplearningbook.org
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, pp 580–587, DOI https://doi.org/10.1109/CVPR.2014.81, (to appear in print)
Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448, DOI https://doi.org/10.1109/ICCV.2015.169, (to appear in print)
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc. pp 1097–1105
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151. http://dblp.uni-trier.de/db/journals/nn/nn12.html#Qian99
Article Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(61):2121–2159. http://jmlr.org/papers/v12/duchi11a.html
MathSciNet MATH Google Scholar
Kingma D P, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3Rd international conference on learning representations, ICLR 2015, San diego, May 7-9, p 2015
Dogo EM, Afolabi OJ, Nwulu NI, Twala B, Aigbavboa CO (2018) A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. In: 2018 International conference on computational techniques, electronics and mechanical systems (CTEMS), pp 92–99, DOI https://doi.org/10.1109/CTEMS.2018.8769211, (to appear in print)
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD, Heidelberg, pp 177–186
Robbins H, Monro S (1951) A stochastic approximation method. Annals Math Stat 22 (3):400–407. http://www.jstor.org/stable/2236626
Article MathSciNet MATH Google Scholar
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Vol 28. JMLR.org, ICML’13, p III–1139–III–1147
Botev A, Lever G, Barber D (2017) Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In: 2017 International joint conference on neural networks (IJCNN), pp 1899–1903, DOI https://doi.org/10.1109/IJCNN.2017.7966082 https://doi.org/10.1109/IJCNN.2017.7966082 , (to appear in print)
Lydia A, Francis S (2019) Adagrad - an optimizer for stochastic gradient descent vol 6. pp 566–568
Fang JK, Fong CM, Yang P, Hung CK, Lu WL, Chang CW (2020) Adagrad gradient descent method for ai image management. In: 2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan), pp 1–2, DOI https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258085, (to appear in print)
Zeiler M D (2012) ADADELTA: An adaptive learning rate method. CoRR:1212.5701
Hinton G, Srivastava KSN (2012) Lecture 6a overview of mini-batch gradient descent course. In: Neural networks for machine learning
Dorronsoro JR, Gonzalez A, Cruz CS (2002) Natural Gradient Learning in NLDA Networks, Conference on Artificial and Natural Neural Networks: Connectionist Models of Neurons, Learning Processes and Artificial Intelligence-Part-I, pp 427–434. https://dl.acm.org/doi/10.5555/646369.690794
Zhang J (2019) Gradient descent based optimization algorithms for deep learning models training. CoRR: 1903.03614. https://doi.org/10.1007/978-3-030-33904-3_40
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 30. https://proceedings.neurips.cc/paper/2017/file/81b3833e2504647f9d794f7d7b9bf341-Paper.pdf
Lyu K, Li J (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International conference on learning representations. https://openreview.net/forum?id=SJeLIgBKPS
Zhuang J, Tang T, Ding Y, Tatikonda S, Dvornek N C, Papademetris X, Duncan JS (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. CoRR: 2010.07468
Defazio A, Jelassi S (2021) Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. 2101.11075
Dubey SR, Chakraborty S, Roy SK, Mukherjee S, Singh SK, Chaudhuri BB (2020) diffgrad: An optimization method for convolutional neural networks. IEEE Trans Neural Netw Learn Syst 31 (11):4500–4511. https://doi.org/10.1109/TNNLS.2019.2955777
Article MathSciNet Google Scholar
Yong H, Huang J, Hua X, Zhang L (2020) Gradient-centralization: A new optimization technique for deep neural networks. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer Vision ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12346. https://doi.org/10.1007/978-3-030-58452-8_37. Springer, Cham
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: Proceedings of the Eighth International Conference on Learning Representations (ICLR), vol 2020. pp 1–13
Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Tront, Master’s thesis
Google Scholar
Rosenbrock HH (1960) An Automatic Method for Finding the Greatest or Least Value of a Function. Comput J 3(3):175–184. https://doi.org/10.1093/comjnl/3.3.175. https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/3.3.175
Article MathSciNet Google Scholar
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 27. https://proceedings.neurips.cc/paper/2014/file/17e23e50bedc63b4095e3d8204ce063b-Paper.pdf
Lacotte J, Pilanci M (2020) All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape. 2006.05900
Kawaguchi K, Kaelbling LP (2020) Elimination of all bad local minima in deep learning. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. PMLR,108, pp 853–863
Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3-4):142–336. https://doi.org/10.1561/2200000058
Article MATH Google Scholar
Danilova M, Dvurechensky PE, Gasnikov AV, Gorbunov EA, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. CoRR:2012.06188. https://dblp.uni-trier.de/rec/journals/corr/abs-2012-06188.html
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90, (to appear in print)
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)
Liu S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp 730–734, DOI https://doi.org/10.1109/ACPR.2015.7486599, (to appear in print)
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings Of the 27th international conference on machine learning, Haifa, 21 June 2010, pp 807–814
Andrew L, Maas AYN, Hannun AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the international conference on machine learning, pp 1–6
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pp 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification.pdf
Huang G, Liu Z, van der Maaten L, Weinberger K (2017) Densely Connected Convolutional Networks. https://doi.org/10.1109/CVPR.2017.243
LecunBottou Y, et al. (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol 88. no. 11
Deng L (2012) /E MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Proc Mag 29(6):141–142
Article Google Scholar
Amari S-i (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276
Article Google Scholar
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol 30. pp 16–21

Download references

Acknowledgements

We’d like to thank to the Dept. of Computer Science, Vidyasagar University, Paschim Medinipur for providing infrastructure.

Author information

Authors and Affiliations

Dept. of Computer Science, Vidyasagar University, West Bengal, India
Shubhankar Bhakta & Utpal Nandi
Dept. of Computer Science and Engineering, Bankura Unnayani Institute of Engineering, West Bengal, India
Tapas Si
Dept. of Computer Science and Technology, Behala Goverment Polytechnic, West Bengal, India
Sudipta Kr Ghosal
Dept. of Computer Science, Belda College, West Bengal, India
Chiranjit Changdar
Dept. of Computer Science and Engineering, University of Calcutta, West Bengal, India
Rajat Kumar Pal

Authors

Shubhankar Bhakta
View author publications
You can also search for this author in PubMed Google Scholar
Utpal Nandi
View author publications
You can also search for this author in PubMed Google Scholar
Tapas Si
View author publications
You can also search for this author in PubMed Google Scholar
Sudipta Kr Ghosal
View author publications
You can also search for this author in PubMed Google Scholar
Chiranjit Changdar
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Kumar Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Utpal Nandi.

Ethics declarations

Conflict of Interests

The authors have no potential conflict of interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bhakta, S., Nandi, U., Si, T. et al. DiffMoment: an adaptive optimization technique for convolutional neural network. Appl Intell 53, 16844–16858 (2023). https://doi.org/10.1007/s10489-022-04382-7

Download citation

Accepted: 05 December 2022
Published: 17 December 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10489-022-04382-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DiffMoment: an adaptive optimization technique for convolutional neural network

Abstract

Access this article

Similar content being viewed by others

emapDiffP: A novel learning algorithm for convolutional neural network optimization

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Moment Centralization-Based Gradient Descent Optimizers for Convolutional Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DiffMoment: an adaptive optimization technique for convolutional neural network

Abstract

Access this article

Similar content being viewed by others

emapDiffP: A novel learning algorithm for convolutional neural network optimization

AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio

Moment Centralization-Based Gradient Descent Optimizers for Convolutional Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation