Abstract
Incorporating second order curvature information in gradient based methods have shown to improve convergence drastically despite its computational intensity. In this paper, we propose a stochastic (online) quasi-Newton method with Nesterov’s accelerated gradient in both its full and limited memory forms for solving large scale non-convex optimization problems in neural networks. The performance of the proposed algorithm is evaluated in Tensorflow on benchmark classification and regression problems. The results show improved performance compared to the classical second order oBFGS and oLBFGS methods and popular first order stochastic methods such as SGD and Adam. The performance with different momentum rates and batch sizes have also been illustrated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Pearson Prentice Hall, Upper Saddle River (2009)
Bottou, L., Cun, Y.L.: Large scale online learning. In: Advances in Neural Information Processing Systems, pp. 217–224 (2004)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT’ 2010. Physica-Verlag HD (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Peng, X., Li, L., Wang, F.Y.: Accelerating minibatch stochastic gradient descent using typicality sampling. arXiv preprint arXiv:1903.04192 (2019)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
Tieleman, T., Hinton, G.: Lecture 6.5-RMSprop, Coursera: neural networks for machine learning. University of Toronto, Technical Report (2012)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Martens, J.: Deep learning via Hessian-free optimization. ICML 27, 735–742 (2010)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods I: globally convergent algorithms. arXiv preprint arXiv:1601.04737 (2016)
Dennis Jr., J.E., Moré, J.J.: Quasi-Newton methods, motivation and theory. SIAM Rev. 19(1), 46–89 (1977)
Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-Newton method for online convex optimization. In: Artificial Intelligence and Statistics, pp. 436–443 (2007)
Mokhtari, A., Ribeiro, A.: RES: regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089–6104 (2014)
Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)
Li, Y., Liu, H.: Implementation of stochastic quasi-Newton’s method in PyTorch. arXiv preprint arXiv:1805.02338 (2018)
Lucchi, A., McWilliams, B., Hofmann, T.: A variance reduced stochastic Newton method. arXiv preprint arXiv:1503.08316 (2015)
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic l-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258 (2016)
Bollapragada, R., Mudigere, D., Nocedal, J., Shi, H.J.M., Tang, P.T.P.: A progressive batching l-BFGS method for machine learning. arXiv preprint arXiv:1802.05374 (2018)
Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–995 (2011)
Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878 (2016)
Ninomiya, H.: A novel quasi-Newton-based optimization for neural network training incorporating Nesterov’s accelerated gradient. Nonlinear Theory Appl. IEICE 8(4), 289–301 (2017)
Mahboubi, S., Ninomiya, H.: A novel training algorithm based on limited-memory quasi-Newton method with Nesterov’s accelerated gradient in neural networks and its application to highly-nonlinear modeling of microwave circuit. IARIA Int. J. Adv. Softw. 11(3–4), 323–334 (2018)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research, 2nd edn. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Zhang, L.: A globally convergent BFGS method for nonconvex minimization without line searches. Optim. Methods Softw. 20(6), 737–747 (2005)
Dai, Y.H.: Convergence properties of the BFGS algoritm. SIAM J. Optim. 13(3), 693–701 (2002)
Indrapriyadarsini, S., Mahboubi, S., Ninomiya, H., Asai, H.: Implementation of a modified Nesterov’s accelerated quasi-Newton method on Tensorflow. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1147–1154. IEEE (2018)
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 928–936 (2003)
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009). https://archive.ics.uci.edu/ml/datasets/wine+quality
Rana, P.: Physicochemical properties of protein tertiary structure data set. UCI Machine Learning Repository (2013). https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure
Alpaydin, E., Kaynak, C.: Optical recognition of handwritten digits data set. UCI Machine Learning Repository (1998). https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database. AT&T Labs. http://yann.lecun.com/exdb/mnist (2010)
Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML, vol. 28, no. 3, pp. 1139–1147 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Indrapriyadarsini, S., Mahboubi, S., Ninomiya, H., Asai, H. (2020). A Stochastic Quasi-Newton Method with Nesterov’s Accelerated Gradient. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-46150-8_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46149-2
Online ISBN: 978-3-030-46150-8
eBook Packages: Computer ScienceComputer Science (R0)