Abstract
We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a non-Euclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces non-Euclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual co-ordinate system. We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Amari, S., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58(1), 183–195 (2010)
Amari, S.-I., Barndoff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differential Geometry in Statistical Inference. IMS Lecture Notes - Monograph Series. Institute of Mathematical Statistic, Hayward (1987)
Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of dsitributions. Mach. Learn. 43(3), 211–246 (2001)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Barndorff-Nielson, O.E.: Information and Exponential Families. Wiley, Chichester (1978)
Bonnabel, S.: Stochastic gradient descent on Riemannian manifiolds. Technical report, Mines Paris Tech (2011)
Bregman, L.M.: The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 191–204 (1967)
Brown, L.D.: Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, Hayward (1986)
DoCarmo, M.P.: Riemannian Geometry. Springer Series in Statistics. Birkhauser, Boston (1992)
Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
Efron, B.: Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Stat. 3(6), 1189–1242 (1975)
Efron, B.: The geometry of exponential families. Ann. Stat. 6, 362–376 (1978)
Fisher, R.A.: Theory of statistical estimation. Math. Proc. Cambridge Philos. Soc. 22, 700–725 (1925)
Lafferty, J.: Additive models, boosting, and inference for generalized divergences. In: COLT (1999)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. Technical report, École Polytechnique (2011)
Rao, C.R.: Information and accuracy obtainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)
Rao, C.R.: Asymptotic efficiency and limiting information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 531–546 (1961)
Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)
Rockafeller, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Wainwright, M.J., Jordan, M.I.: A variational principle for graphical models. In: New Directions in Statistical Signal Processing. MIT Press, Cambridge, MA (2006)
Acknowledgements
GR was partially supported by the NSF under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. SM was supported by grants: NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436, and NSF CCF-1049290.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Raskutti, G., Mukherjee, S. (2015). The Information Geometry of Mirror Descent. In: Nielsen, F., Barbaresco, F. (eds) Geometric Science of Information. GSI 2015. Lecture Notes in Computer Science(), vol 9389. Springer, Cham. https://doi.org/10.1007/978-3-319-25040-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-25040-3_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25039-7
Online ISBN: 978-3-319-25040-3
eBook Packages: Computer ScienceComputer Science (R0)