Training Deep and Recurrent Networks with Hessian-Free Optimization

  • James Martens
  • Ilya Sutskever
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7700)


In this chapter we will first describe the basic HF approach, and then examine well-known performance-improving techniques such as preconditioning which we have found to be beneficial for neural network training, as well as others of a more heuristic nature which are harder to justify, but which we have found to work well in practice. We will also provide practical tips for creating efficient and bug-free implementations and discuss various pitfalls which may arise when designing and using an HF-type approach in a particular application.


Trust Region Recurrent Neural Network Hide State Krylov Subspace Quadratic Approximation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari, S.I.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Becker, S., Le Cun, Y.: Improving the convergence of back-propagation learning with second order methods. In: Proceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Morgan Kaufmann, San Matteo (1988)Google Scholar
  3. 3.
    Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Advances in Neural Information Processing Systems, vol. 19, p. 153 (2007)Google Scholar
  4. 4.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2), 157–166 (1994)CrossRefGoogle Scholar
  5. 5.
    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a cpu and gpu math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), vol. 4 (2010)Google Scholar
  6. 6.
    Bishop, C.: Exact calculation of the hessian matrix for the multilayer perceptron. Neural Computation 4(4), 494–501 (1992)CrossRefGoogle Scholar
  7. 7.
    Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)Google Scholar
  8. 8.
    Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization 21, 977 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Mathematical Programming, 1–29 (2012)Google Scholar
  10. 10.
    Chapelle, O., Erhan, D.: Improved preconditioner for hessian free optimization. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)Google Scholar
  11. 11.
    Schraudolph, N.: Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent. Neural Computation 14 (2002)Google Scholar
  12. 12.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2010). Society for Artificial Intelligence and Statistics (2010)Google Scholar
  13. 13.
    Gould, N.I.M., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region subproblem using the lanczos method. SIAM Journal on Optimization 9(2), 504–525 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Hansen, P.C., O’Leary, D.P.: The use of the l-curve in the regularization of discrete ill-posed problems. SIAM Journal on Scientific Computing 14, 1487 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems (1952)Google Scholar
  16. 16.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18(7), 1527–1554 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Hochreiter, S.: Untersuchungen zu dynamischen neuronalen netzen. diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München (1991)Google Scholar
  19. 19.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  20. 20.
    Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004)CrossRefGoogle Scholar
  21. 21.
    LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Orr, G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  22. 22.
    Martens, J.: Deep learning via hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML), vol. 951 (2010)Google Scholar
  23. 23.
    Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proc. ICML (2011)Google Scholar
  24. 24.
    Martens, J., Sutskever, I., Swersky, K.: Estimating the hessian by backpropagating curvature. In: Proc. ICML (2012)Google Scholar
  25. 25.
    Moré, J.J.: The levenberg-marquardt algorithm: implementation and theory. Numerical Analysis, 105–116 (1978)Google Scholar
  26. 26.
    Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM Journal on Scientific and Statistical Computing 4, 553 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Nash, S.G.: Newton-type minimization via the lanczos method. SIAM Journal on Numerical Analysis, 770–788 (1984)Google Scholar
  28. 28.
    Nash, S.G.: A survey of truncated-newton methods. Journal of Computational and Applied Mathematics 124(1), 45–59 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/k2). In: Doklady AN SSSR, vol. 269, pp. 543–547 (1983)Google Scholar
  30. 30.
    Nocedal, J., Wright, S.J.: Numerical optimization. Springer (1999)Google Scholar
  31. 31.
    Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computation 6(1), 147–160 (1994)CrossRefGoogle Scholar
  32. 32.
    Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain (1994)Google Scholar
  33. 33.
    Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. arXiv preprint arXiv:1111.4259 (2011)Google Scholar
  34. 34.
    Wengert, R.E.: A simple automatic derivative evaluation program. Communications of the ACM 7(8), 463–464 (1964)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • James Martens
    • 1
  • Ilya Sutskever
    • 1
  1. 1.Department of Computer ScienceUniversity of TorontoCanada

Personalised recommendations