Natural Conjugate Gradient Training of Multilayer Perceptrons

  • Ana González
  • José R. Dorronsoro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4131)


For maximum log–likelihood estimation, the Fisher matrix defines a Riemannian metric in weight space and, as shown by Amari and his coworkers, the resulting natural gradient greatly accelerates on–line multilayer perceptron (MLP) training. While its batch gradient descent counterpart also improves on standard gradient descent (as it gives a Gauss–Newton approximation to mean square error minimization), it may no longer be competitive with more advanced gradient–based function minimization procedures. In this work we shall show how to introduce natural gradients in a conjugate gradient (CG) setting, showing numerically that when applied to batch MLP learning, they lead to faster convergence to better minima than that achieved by standard euclidean CG descent. Since a drawback of full natural gradient is its larger computational cost, we also consider some cost simplifying variants and show that one of them, diagonal natural CG, also gives better minima than standard CG, with a comparable complexity.


Conjugate Gradient Hide Unit Natural Gradient Newton Approximation Fisher Matrix 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Amari, S., Nagaoka, H.: Methods of information geometry. American Mathematical Society (2000)Google Scholar
  3. 3.
    Amari, S., Park, H., Fukumizu, K.: Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons. Neural Computation 12, 1399–1409 (2000)CrossRefGoogle Scholar
  4. 4.
    Duda, R., Hart, P., Stork, D.: Pattern classification. Wiley, Chichester (2000)Google Scholar
  5. 5.
    Heskes, T.: On natural Learning and pruning in multilayered perceptrons. Neural Computation 12, 1037–1057 (2000)CrossRefGoogle Scholar
  6. 6.
    Igel, C., Toussaint, M., Weishui, W.: Rprop Using the Natural Gradient. In: Trends and Applications in Constructive Approximation. International Series of Numerical Mathematics, vol. 151, Birkhäuser, Basel (2005)Google Scholar
  7. 7.
    LeCun, J., Bottou, L., Orr, G., Müller, K.R.: Efficient BackProp. In: Neural Networks: tricks of the trade, pp. 9–50. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  8. 8.
    Murray, M., Rice, J.W.: Differential Geometry and Statistics. Chapman & Hall, Boca Raton (1993)MATHGoogle Scholar
  9. 9.
    Murphy, P., Aha, D.: UCI Repository of Machine Learning Databases, Tech. Report, University of Califonia, Irvine (1994)Google Scholar
  10. 10.
    Polak, F.: Computational Methods in Optimization. Academic Press, London (1971)Google Scholar
  11. 11.
    Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C. Cambridge U. Press, New York (1988)MATHGoogle Scholar
  12. 12.
    Rao, C.R.: Information and accuracy attainable in estimation of statistical parameters. Bull. Cal. Math. Soc. 37, 81–91 (1945)MATHGoogle Scholar
  13. 13.
    Rattray, M., Saad, D., Amari, S.: Natural gradient descent for on–line learning. Physical Review Letters 81, 5461–5464 (1998)CrossRefGoogle Scholar
  14. 14.
    Yang, H., Amari, S.: Complexity Issues in Natural Gradient Descent Method for Training Multi-Layer Perceptrons. Neural Computation 10, 2137–2157 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ana González
    • 1
  • José R. Dorronsoro
    • 1
  1. 1.Dpto. de Ingeniería Informática and Instituto de Ingeniería del ConocimientoUniversidad Autónoma de MadridMadridSpain

Personalised recommendations