adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

  • Nitish Shirish Keskar
  • Albert S. Berahas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9851)


Recurrent Neural Networks, or RNNs, are powerful models that achieve exceptional performance on a plethora pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known “vanishing/exploding” gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as Adagrad and Adam, while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present adaQN, a stochastic quasi-Newton algorithm for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method uses a novel L-BFGS scaling initialization scheme and is judicious in storing and retaining L-BFGS curvature pairs. We present numerical experiments on two language modeling tasks and show that adaQN is competitive with popular RNN training algorithms.


Fisher Information Matrix Stochastic Gradient Stochastic Gradient Descent Curvature Information Hessian Approximation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. Neural Netw. IEEE Trans. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  2. 2.
    Bengio, Y., Goodfellow, I., Courville, A.: Deep learning (2016). Book in preparation for MIT Press.
  3. 3.
    Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Cho, K., Van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar, October 2014Google Scholar
  5. 5.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Graves, A.: Supervised sequence labelling with Recurrent Neural Networks, vol. 385. Springer, Heidelberg (2012)zbMATHGoogle Scholar
  7. 7.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 6645–6649 (2013)Google Scholar
  8. 8.
    Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing Systems (NIPS 2009), pp. 545–552 (2009)Google Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  10. 10.
    Karpathy, A., Johnson, J., Li, F.: Visualizing and understanding recurrent networks. In: International Conference on Learning Representations (ICLR 2016) (2016)Google Scholar
  11. 11.
    Keskar, N., Berahas, A.S.: minSQN: Stochastic Quasi-Newton Optimization in MATLAB (2016).
  12. 12.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR 2015) (2015)Google Scholar
  13. 13.
    Le, Q.V., Jaitly, N., Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. arXiv preprint (2015). arXiv:1504.00941
  14. 14.
    Martens, J.: Deep learning via hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (2010)Google Scholar
  15. 15.
    Martens, J.: New perspectives on the natural gradient method. arXiv preprint (2014). arXiv:1412.1193
  16. 16.
    Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: Proceedings of the 32th International Conference on Machine Learning (ICML 2015) (2015)Google Scholar
  17. 17.
    Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 1033–1040 (2011)Google Scholar
  18. 18.
    Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Cernocky, J.: Rnnlm-recurrent neural network language modeling toolkit. In: Proceedings of the 2011 ASRU Workshop, pp. 196–201 (2011)Google Scholar
  19. 19.
    Mokhtari, A., Ribeiro, A.: RES: Regularized stochastic BFGS algorithm. Sig. Proces. IEEE Trans. 62(23), 6089–6104 (2014)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16, 3151–3181 (2015)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27, 372–376 (1983)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Nocedal, J., Wright, S.: Numerical optimization. Springer, Heidelberg (2006)zbMATHGoogle Scholar
  23. 23.
    Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations (ICLR 2013) (2013)Google Scholar
  24. 24.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1310–1318 (2013)Google Scholar
  25. 25.
    Robinson, T., Hochberg, M., Renals, S.: The use of recurrent neural networks in continuous speech recognition. Automatic speech and speaker recognition. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  26. 26.
    Schraudolph, N.N., Yu, J., Günter, S.: A stochastic quasi-newton method for online convex optimization. In: International Conference on Artificial Intelligence and Statistics, pp. 436–443 (2007)Google Scholar
  27. 27.
    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pp. 1139–1147 (2013)Google Scholar
  28. 28.
    Tieleman, T., Hinton, G.: Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31 (2012)Google Scholar
  29. 29.
    Wang, X., Ma, S., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. arXiv preprint (2014). arXiv:1412.1196
  30. 30.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent Neural Network Regularization. arXiv preprint (2014). arXiv:1409.2329

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Industrial Engineering and Management SciencesNorthwestern UniversityEvanstonUSA
  2. 2.Department of Engineering Sciences and Applied MathematicsNorthwestern UniversityEvanstonUSA

Personalised recommendations