Abstract
Stochastic Gradient Decent (SGD) is a very popular basic optimizer applied in the learning algorithms of deep neural networks. However, it has fixed-sized steps for every epoch without considering gradient behaviour to determine step size. The improved SGD optimizers like AdaGrad, Adam, AdaDelta, RAdam, and RMSProp make step sizes adaptive in every epoch. However, these optimizers depend on square roots of exponential moving averages (EMA) of squared previous gradients or momentums or both and cannot take the benefit of local change in gradients or momentums or both. To reduce these limitations, a novel optimizer has been presented in this paper where the adjustment of step size is done for each parameter based on changing information between the 1st and the 2nd moment estimate (i.e., diffMoment). The experimental results depict that diffMoment offers better performance than AdaGrad, Adam, AdaDelta, RAdam, and RMSProp optimizers. It is also noticed that diffMoment does uniformly better for training Convolutional Neural Networks (CNN) applying different activation functions.
Similar content being viewed by others
References
Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. In: Neurocomputing:Foundations of research. http://dl.acm.org/citation.cfm?id=65669.104451. MIT Press, Cambridge, pp 696–699
Mirjalili S (2015) How effective is the Grey Wolf optimizer in training multi-layer perceptrons. Appl Intell 43:150–161. https://doi.org/10.1007/s10489-014-0645-7
Faris H, Aljarah I, Mirjalili S (2016) Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl Intell 45:322–332. https://doi.org/10.1007/s10489-016-0767-1
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–44. https://doi.org/10.1038/nature14539
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press. http://www.deeplearningbook.org
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE conference on computer vision and pattern recognition, pp 580–587, DOI https://doi.org/10.1109/CVPR.2014.81, (to appear in print)
Girshick R (2015) Fast r-cnn. In: 2015 IEEE international conference on computer vision (ICCV), pp 1440–1448, DOI https://doi.org/10.1109/ICCV.2015.169, (to appear in print)
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc. pp 1097–1105
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12 (1):145–151. http://dblp.uni-trier.de/db/journals/nn/nn12.html#Qian99
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(61):2121–2159. http://jmlr.org/papers/v12/duchi11a.html
Kingma D P, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3Rd international conference on learning representations, ICLR 2015, San diego, May 7-9, p 2015
Dogo EM, Afolabi OJ, Nwulu NI, Twala B, Aigbavboa CO (2018) A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. In: 2018 International conference on computational techniques, electronics and mechanical systems (CTEMS), pp 92–99, DOI https://doi.org/10.1109/CTEMS.2018.8769211, (to appear in print)
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of COMPSTAT’2010. Physica-Verlag HD, Heidelberg, pp 177–186
Robbins H, Monro S (1951) A stochastic approximation method. Annals Math Stat 22 (3):400–407. http://www.jstor.org/stable/2236626
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Vol 28. JMLR.org, ICML’13, p III–1139–III–1147
Botev A, Lever G, Barber D (2017) Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. In: 2017 International joint conference on neural networks (IJCNN), pp 1899–1903, DOI https://doi.org/10.1109/IJCNN.2017.7966082https://doi.org/10.1109/IJCNN.2017.7966082 , (to appear in print)
Lydia A, Francis S (2019) Adagrad - an optimizer for stochastic gradient descent vol 6. pp 566–568
Fang JK, Fong CM, Yang P, Hung CK, Lu WL, Chang CW (2020) Adagrad gradient descent method for ai image management. In: 2020 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan), pp 1–2, DOI https://doi.org/10.1109/ICCE-Taiwan49838.2020.9258085, (to appear in print)
Zeiler M D (2012) ADADELTA: An adaptive learning rate method. CoRR:1212.5701
Hinton G, Srivastava KSN (2012) Lecture 6a overview of mini-batch gradient descent course. In: Neural networks for machine learning
Dorronsoro JR, Gonzalez A, Cruz CS (2002) Natural Gradient Learning in NLDA Networks, Conference on Artificial and Natural Neural Networks: Connectionist Models of Neurons, Learning Processes and Artificial Intelligence-Part-I, pp 427–434. https://dl.acm.org/doi/10.5555/646369.690794
Zhang J (2019) Gradient descent based optimization algorithms for deep learning models training. CoRR: 1903.03614. https://doi.org/10.1007/978-3-030-33904-3_40
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 30. https://proceedings.neurips.cc/paper/2017/file/81b3833e2504647f9d794f7d7b9bf341-Paper.pdf
Lyu K, Li J (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International conference on learning representations. https://openreview.net/forum?id=SJeLIgBKPS
Zhuang J, Tang T, Ding Y, Tatikonda S, Dvornek N C, Papademetris X, Duncan JS (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. CoRR: 2010.07468
Defazio A, Jelassi S (2021) Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. 2101.11075
Dubey SR, Chakraborty S, Roy SK, Mukherjee S, Singh SK, Chaudhuri BB (2020) diffgrad: An optimization method for convolutional neural networks. IEEE Trans Neural Netw Learn Syst 31 (11):4500–4511. https://doi.org/10.1109/TNNLS.2019.2955777
Yong H, Huang J, Hua X, Zhang L (2020) Gradient-centralization: A new optimization technique for deep neural networks. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer Vision ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12346. https://doi.org/10.1007/978-3-030-58452-8_37. Springer, Cham
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: Proceedings of the Eighth International Conference on Learning Representations (ICLR), vol 2020. pp 1–13
Krizhevsky A (2009) Learning multiple layers of features from tiny images. University of Tront, Master’s thesis
Rosenbrock HH (1960) An Automatic Method for Finding the Greatest or Least Value of a Function. Comput J 3(3):175–184. https://doi.org/10.1093/comjnl/3.3.175. https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/3.3.175
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 27. https://proceedings.neurips.cc/paper/2014/file/17e23e50bedc63b4095e3d8204ce063b-Paper.pdf
Lacotte J, Pilanci M (2020) All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape. 2006.05900
Kawaguchi K, Kaelbling LP (2020) Elimination of all bad local minima in deep learning. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. PMLR,108, pp 853–863
Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3-4):142–336. https://doi.org/10.1561/2200000058
Danilova M, Dvurechensky PE, Gasnikov AV, Gorbunov EA, Guminov S, Kamzolov D, Shibaev I (2020) Recent theoretical advances in non-convex optimization. CoRR:2012.06188. https://dblp.uni-trier.de/rec/journals/corr/abs-2012-06188.html
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778, DOI https://doi.org/10.1109/CVPR.2016.90, (to appear in print)
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)
Liu S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp 730–734, DOI https://doi.org/10.1109/ACPR.2015.7486599, (to appear in print)
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9, DOI https://doi.org/10.1109/CVPR.2015.7298594, (to appear in print)
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings Of the 27th international conference on machine learning, Haifa, 21 June 2010, pp 807–814
Andrew L, Maas AYN, Hannun AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the international conference on machine learning, pp 1–6
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pp 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification.pdf
Huang G, Liu Z, van der Maaten L, Weinberger K (2017) Densely Connected Convolutional Networks. https://doi.org/10.1109/CVPR.2017.243
LecunBottou Y, et al. (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, vol 88. no. 11
Deng L (2012) /E MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Proc Mag 29(6):141–142
Amari S-i (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251–276
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol 30. pp 16–21
Acknowledgements
We’d like to thank to the Dept. of Computer Science, Vidyasagar University, Paschim Medinipur for providing infrastructure.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors have no potential conflict of interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhakta, S., Nandi, U., Si, T. et al. DiffMoment: an adaptive optimization technique for convolutional neural network. Appl Intell 53, 16844–16858 (2023). https://doi.org/10.1007/s10489-022-04382-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04382-7