Abstract
It is well known that we need to choose the hyper-parameters in Momentum, AdaGrad, AdaDelta, and other alternative stochastic optimizers. While in many cases, the hyper-parameters are tuned tediously based on experience becoming more of an art than science. We introduce a novel per-dimension learning rate method for stochastic gradient optimization called AdaSmooth. The method is insensitive to hyper-parameters thus it requires no manual tuning of the hyper-parameters like Momentum, AdaGrad, and AdaDelta methods. We show promising results compared to other methods on different convolutional neural networks, multi-layer perceptron, and alternative machine learning tasks. Empirical results demonstrate that AdaSmooth works well in practice and compares favourably to other stochastic optimization methods in neural networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The Census Income data set has 48842 number of samples totally where 70% of them are used as training set in our case: https://archive.ics.uci.edu/ml/datasets/Census+Income.
- 2.
The MNIST data set has a training set of 60,000 examples, and a test set of 10,000 examples.
References
Becker, S., & LeCun, Y. (1988). Improving the convergence of back-propagation learning with second-order methods.
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems (Vol. 27).
Dozat, T. (2016). Incorporating Nesterov momentum into Adam.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7).
Graves, A., Mohamed, A.-R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6645–6649). IEEE.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, G., Srivastava, N., & Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Neural Networks for Machine Learning, 14(8), 2.
Kaufman, P. J. (1995). Smarter trading.
Kaufman, P. J. (2013). Trading systems and methods, + website (Vol. 591). Wiley.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (Vol. 25)
Le Roux, N., & Fitzgibbon, A. W. (2010). A fast natural newton method. In ICML.
LeCun, Y. (1998). The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
Lu, J. (2022a). Exploring classic quantitative strategies. arXiv:2202.11309
Lu, J. (2022b). Gradient descent, stochastic optimization, and other tales. arXiv:2205.00832
Lu, J. (2022c). Matrix decomposition and applications. arXiv:2201.00145
Lu, J., & Yi, S. (2022). Reducing overestimating and underestimating volatility via the augmented blending-arch model. Applied Economics and Finance, 9(2), 48–59.
Menard, S. (2002). Applied logistic regression analysis (Vol. 106). Sage.
Moulines, E., & Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in neural information processing systems (Vol. 24).
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. In The annals of mathematical statistics (pp. 400–407).
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering.
Schaul, T., Zhang, S., & LeCun, Y. (2013). No more pesky learning rates. In International Conference on Machine Learning (pp. 343–351). PMLR.
Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P. E., & Teh, Y. (2021). Powerpropagation: A sparsity inducing weight reparameterisation. In Advances in neural information processing systems (Vol. 34, pp. 28889–28903).
Smith, L. N. (2017). Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 464–472). IEEE.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (pp. 1139–1147). PMLR.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.-J. (2019). Large batch optimization for deep learning: Training Bert in 76 minutes. arXiv:1904.00962
Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arXiv:1212.5701
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lu, J. (2023). AdaSmooth: An Adaptive Learning Rate Method Based on Effective Ratio. In: Shakya, S., Du, KL., Ntalianis, K. (eds) Sentiment Analysis and Deep Learning. Advances in Intelligent Systems and Computing, vol 1432. Springer, Singapore. https://doi.org/10.1007/978-981-19-5443-6_21
Download citation
DOI: https://doi.org/10.1007/978-981-19-5443-6_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5442-9
Online ISBN: 978-981-19-5443-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)