Skip to main content

A Formalization of the Natural Gradient Method for General Similarity Measures

Part of the Lecture Notes in Computer Science book series (LNIP,volume 11712)


In optimization, the natural gradient method is well-known for likelihood maximization. The method uses the Kullback–Leibler (KL) divergence, corresponding infinitesimally to the Fisher–Rao metric, which is pulled back to the parameter space of a family of probability distributions. This way, gradients with respect to the parameters respect the Fisher–Rao geometry of the space of distributions, which might differ vastly from the standard Euclidean geometry of the parameter space, often leading to faster convergence. The concept of natural gradient has in most discussions been restricted to the KL-divergence/Fisher–Rao case, although in information geometry the local \(C^2\) structure of a general divergence has been used for deriving a closely related Riemannian metric analogous to the KL-divergence case. In this work, we wish to cast natural gradients into this more general context and provide example computations, notably in the case of a Finsler metric and the p-Wasserstein metric. We additionally discuss connections between the natural gradient method and multiple other optimization techniques in the literature.


  • Optimization
  • Natural gradient
  • Statistical manifolds

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Agueh, M.: Finsler structure in the p-Wasserstein space and gradient flows. Comptes Rendus Mathematique 350(1–2), 35–40 (2012)

    CrossRef  MathSciNet  Google Scholar 

  2. Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)

    CrossRef  Google Scholar 

  3. Amari, S.i.: Divergence function, information monotonicity and information geometry. In: Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). Citeseer (2009)

    Google Scholar 

  4. Amari, S.I.: Information Geometry and Its Applications. Springer, Tokyo (2016).

    CrossRef  MATH  Google Scholar 

  5. Bercu, G.: Gradient methods on Finsler manifolds. In: Proceedings of the Workshop on Global Analysis, Differential Geometry and Lie Algebras, pp. 230–233 (2000)

    Google Scholar 

  6. Chen, Y., Li, W.: Natural gradient in Wasserstein statistical manifold. arXiv preprint arXiv:1805.08380 (2018)

  7. Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)

    Google Scholar 

  8. Khan, M.E., Baqué, P., Fleuret, F., Fua, P.: Kullback-Leibler proximal variational inference. In: Advances in Neural Information Processing Systems, pp. 3402–3410 (2015)

    Google Scholar 

  9. Le Roux, N., Manzagol, P.A., Bengio, Y.: Topmoumoute online natural gradient algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856 (2008)

    Google Scholar 

  10. Li, W., Montúfar, G.: Natural gradient via optimal transport. Inf. Geom. 1(2), 181–214 (2018)

    CrossRef  Google Scholar 

  11. Martens, J.: Deep learning via Hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)

    Google Scholar 

  12. Martens, J.: New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193 (2014)

  13. Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization (1983)

    Google Scholar 

  14. Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

    CrossRef  Google Scholar 

  15. Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584 (2013)

  16. Raskutti, G., Mukherjee, S.: The information geometry of mirror descent. IEEE Trans. Inf. Theory 61(3), 1451–1457 (2015)

    CrossRef  MathSciNet  Google Scholar 

  17. Saad, Y.: Krylov subspace methods for solving large unsymmetric linear systems. Math. Comput. 37(155), 105–126 (1981)

    CrossRef  MathSciNet  Google Scholar 

  18. Schraudolph, N.N.: Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 14(7), 1723–1738 (2002)

    CrossRef  Google Scholar 

Download references


The authors were supported by Centre for Stochastic Geometry and Advanced Bioimaging, and a block stipendium, both funded by a grant from the Villum Foundation. We furthermore wish to thank the anonymous reviewers for their very useful comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Anton Mallasto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mallasto, A., Haije, T.D., Feragen, A. (2019). A Formalization of the Natural Gradient Method for General Similarity Measures. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2019. Lecture Notes in Computer Science(), vol 11712. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26979-1

  • Online ISBN: 978-3-030-26980-7

  • eBook Packages: Computer ScienceComputer Science (R0)