Natural Conjugate Gradient Training of Multilayer Perceptrons
For maximum log–likelihood estimation, the Fisher matrix defines a Riemannian metric in weight space and, as shown by Amari and his coworkers, the resulting natural gradient greatly accelerates on–line multilayer perceptron (MLP) training. While its batch gradient descent counterpart also improves on standard gradient descent (as it gives a Gauss–Newton approximation to mean square error minimization), it may no longer be competitive with more advanced gradient–based function minimization procedures. In this work we shall show how to introduce natural gradients in a conjugate gradient (CG) setting, showing numerically that when applied to batch MLP learning, they lead to faster convergence to better minima than that achieved by standard euclidean CG descent. Since a drawback of full natural gradient is its larger computational cost, we also consider some cost simplifying variants and show that one of them, diagonal natural CG, also gives better minima than standard CG, with a comparable complexity.
KeywordsConjugate Gradient Hide Unit Natural Gradient Newton Approximation Fisher Matrix
Unable to display preview. Download preview PDF.
- 2.Amari, S., Nagaoka, H.: Methods of information geometry. American Mathematical Society (2000)Google Scholar
- 4.Duda, R., Hart, P., Stork, D.: Pattern classification. Wiley, Chichester (2000)Google Scholar
- 6.Igel, C., Toussaint, M., Weishui, W.: Rprop Using the Natural Gradient. In: Trends and Applications in Constructive Approximation. International Series of Numerical Mathematics, vol. 151, Birkhäuser, Basel (2005)Google Scholar
- 9.Murphy, P., Aha, D.: UCI Repository of Machine Learning Databases, Tech. Report, University of Califonia, Irvine (1994)Google Scholar
- 10.Polak, F.: Computational Methods in Optimization. Academic Press, London (1971)Google Scholar