Natural gradient via optimal transport

Abstract

We study a natural Wasserstein gradient flow on manifolds of probability distributions with discrete sample spaces. We derive the Riemannian structure for the probability simplex from the dynamical formulation of the Wasserstein distance on a weighted graph. We pull back the geometric structure to the parameter space of any given probability model, which allows us to define a natural gradient flow there. In contrast to the natural Fisher–Rao gradient, the natural Wasserstein gradient incorporates a ground metric on sample space. We illustrate the analysis of elementary exponential family examples and demonstrate an application of the Wasserstein natural gradient to maximum likelihood estimation.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    A length space is one in which the distance between points can be measured as the infimum length of continuous curves between them.

  2. 2.

    We use the direct method, which is a standard technique in optimal control. Here the time is discretized, and the sum replacing the integral is minimized by means of gradient descent with respect to \((p(t)_i)_{i=1,3, t\in \{t_1,\ldots , t_N\}} \in \mathbb {R}^{2\times N}\). A reference for these techniques is [24].

References

  1. 1.

    Amari, S.: Neural learning in structured parameter spaces-natural Riemannian gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems 9, pp. 127–133. MIT, London (1997)

    Google Scholar 

  2. 2.

    Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)

    Article  Google Scholar 

  3. 3.

    Amari, S.: Information Geometry and Its Applications. Number volume 194 in Applied mathematical sciences. Springer, Tokyo (2016)

    Google Scholar 

  4. 4.

    Amari, S., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the Entropy-Relaxed Transportation Problem (2017). arXiv:1709.10219 [cs, math]

  5. 5.

    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN (2017). arXiv:1701.07875 [cs, stat]

  6. 6.

    Ay, N., Jost, J., Lê, H., Schwachhöfer, L.: Information Geometry Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer, Berlin (2017)

    Google Scholar 

  7. 7.

    Bakry, D., Émery, M.: Diffusions hypercontractives. In: Azéma, J., Yor, M. (eds.) Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer, Berlin (1985)

    Google Scholar 

  8. 8.

    Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Campbell, L.: An extended Čencov characterization of the information metric. Proc. Am. Math. Soc. 98, 135–141 (1986)

    Google Scholar 

  10. 10.

    Carlen, E.A., Gangbo, W.: Constrained Steepest Descent in the 2-Wasserstein Metric. Ann. Math. 157(3), 807–846 (2003)

    MathSciNet  Article  Google Scholar 

  11. 11.

    Čencov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, vol. 53. American Mathematical Society, Providence (1982). (Translation from the Russian edited by Lev J. Leifman)

    Google Scholar 

  12. 12.

    Chow, S.-N., Huang, W., Li, Y., Zhou, H.: Fokker–Planck equations for a free energy functional or markov process on a graph. Arch. Ration. Mech. Anal. 203(3), 969–1008 (2012)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Chow, S.-N., Li, W., Zhou, H.: A discrete Schrodinger equation via optimal transport on graphs (2017). arXiv:1705.07583 [math]

  14. 14.

    Chow, S.-N., Li, W., Zhou, H.: Entropy dissipation of Fokker–Planck equations on graphs. Discrete Contin. Dyn. Syst. A 38(10), 4929–4950 (2018)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Chung, F. R. K.: Spectral Graph Theory. Number no. 92 in Regional conference series in mathematics. In: Published for the Conference Board of the mathematical sciences by the American Mathematical Society, Providence, R.I. (1997)

  16. 16.

    Frogner, C., Zhang, C., Mobahi, H., Araya-Polo, M., Poggio, T.: Learning with a Wasserstein loss (2015). arXiv:1506.05439 [cs, stat]

  17. 17.

    Gangbo, W., Li, W., Mou, C.: Geodesic of minimal length in the set of probability measures on graphs. accepted in ESAIM: COCV (2018)

  18. 18.

    Karakida, R., Amari, S.: Information geometry of wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 119–126. Springer, Cham (2017)

    Google Scholar 

  19. 19.

    Kingma, D. P., Adam, J. Ba.: A method for stochastic optimization (2014). CoRR, arXiv:1412.6980

  20. 20.

    Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Lebanon, G.: Axiomatic geometry of conditional models. IEEE Trans. Inf. Theory 51(4), 1283–1294 (2005)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Li, W.: Geometry of probability simplex via optimal transport (2018). arXiv:1803.06360 [math]

  23. 23.

    Li, W., Montufar, G.: Ricci curvature for parameter statistics via optimal transport (2018). arXiv:1807.07095

  24. 24.

    Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with fisher information regularization. J. Sci. Comput. 75, 1581–1595 (2017)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277(2), 423–437 (2007)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Maas, J.: Gradient flows of the entropy for finite Markov chains. J. Funct. Anal. 261(8), 2250–2292 (2011)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Malagò, L., Matteucci, M., Pistone, G.: Towards the geometry of estimation of distribution algorithms based on the exponential family. In: Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, FOGA ’11, New York, NY, USA, 2011. ACM, pp. 230–242

  28. 28.

    Malagò, L., Pistone, G.: Natural gradient flow in the mixture geometry of a discrete exponential family. Entropy 17(12), 4215–4254 (2015)

    MathSciNet  Article  Google Scholar 

  29. 29.

    Mielke, A.: A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems. Nonlinearity 24(4), 1329–1346 (2011)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. J. Geometr. Mech. 9(3), 335–390 (2017)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted boltzmann machines. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 3718–3726. Curran Associates Inc, Red Hook (2016)

    Google Scholar 

  32. 32.

    Montúfar, G., Rauh, J., Ay, N.: On the Fisher metric of conditional probability polytopes. Entropy 16(6), 3207–3233 (2014)

    MathSciNet  Article  Google Scholar 

  33. 33.

    Nelson, E.: Quantum Fluctuations. Princeton series in physics. Princeton University Press, Princeton (1985)

    Google Scholar 

  34. 34.

    Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Diff. Equ. 26(1–2), 101–174 (2001)

    MathSciNet  Article  Google Scholar 

  35. 35.

    Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference on Learning Representations 2014 (Conference Track) (2014)

  36. 36.

    Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) Machine Learning: ECML 2005, pp. 280–291. Springer, Berlin (2005)

    Google Scholar 

  37. 37.

    Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Villani, C.: Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)

    Google Scholar 

  39. 39.

    Wong, T.-K.: Logarithmic divergences from optimal transport and Rényi geometry (2017). arXiv:1712.03610 [cs, math, stat]

  40. 40.

    Yi, S., Wierstra, D., Schaul, T., Schmidhuber, J.: Stochastic search using the natural gradient. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA. ACM, pp. 1161–1168 (2009)

Download references

Acknowledgements

The authors would like to thank Prof. Luigi Malagò for his inspiring talk at UCLA in December 2017. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement no 757983).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wuchen Li.

Appendices

Appendix

In this appendix we review the equivalence of static and dynamical formulations of the \(L^2\)-Wasserstein metric formally. For more details see [38].

Consider the duality of linear programming.

$$\begin{aligned} \begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\inf _{\pi \ge 0}\Big \{\int _{\Omega }\int _{\Omega }\frac{1}{2}d_{\Omega }(x,y)^2\pi (x,y)dxdy:\int _\Omega \pi dy=\rho ^0(x),~\int _\Omega \pi dx=\rho ^1(y)\Big \}\\&\quad =\sup _{\Phi ^1, \Phi ^0}\Big \{\int _{\Omega }\Phi ^1(y)\rho ^1(y)dy-\int _\Omega \Phi ^0(x)\rho ^0(x)dx:\Phi ^1(y)-\Phi ^1(x)\le \frac{1}{2}d_\Omega (x,y)^2\Big \}. \end{aligned} \end{aligned}$$
(21)

By standard considerations, the supremum in the last formula is attained when

$$\begin{aligned} \Phi ^1(y)=\sup _{x\in \Omega }~\Phi ^0(x)+\frac{1}{2}d_\Omega (x,y)^2. \end{aligned}$$
(22)

This means that \(\Phi ^1\), \(\Phi ^0\) are related to the viscosity solution of the Hamilton-Jacobi equation on \(\Omega \):

$$\begin{aligned} \frac{\partial \Phi (t,x)}{\partial t}+\frac{1}{2}g_x^\Omega (\nabla \Phi (t,x), \nabla \Phi (t,x))=0, \end{aligned}$$
(23)

with \(\Phi ^0(x)=\Phi (0,x)\), \(\Phi ^1(x)=\Phi (1,x)\). Hence (21) becomes

$$\begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\sup _{\Phi }\Big \{\int _{\Omega }\Phi ^1(x)\rho ^1(x)-\Phi ^0(x)\rho ^0(x)dx:\frac{\partial \Phi (t,x)}{\partial t}+\frac{1}{2}g_x^\Omega (\nabla \Phi (t,x), \nabla \Phi (t,x))=0 \Big \}. \end{aligned}$$

By the duality of above formulas, we can obtain variational problem (1). In other words, consider the dual variable of \(\Phi _t=\Phi (t,x)\) by the density path \(\rho _t=\rho (t,x)\), then

$$\begin{aligned} \begin{aligned}&\frac{1}{2}W(\rho ^0,\rho ^1)^2\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _{\Omega }\Phi ^1\rho ^1-\Phi ^0\rho ^0dx-\int _0^1\int _{\Omega }\rho _t\big [ \partial _t\Phi _t+\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)dx\big ] dt\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _{\Omega }\Phi ^1\rho ^1-\Phi ^0\rho ^0dx-\int _0^1\int _{\Omega }\rho _t \partial _t\Phi _tdxdt\\&\qquad - \int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt\\&\quad =\sup _{\Phi _t}\inf _{\rho _t}~\int _0^1\int _{\Omega }\partial _t\rho _t \Phi _t-g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt+\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt \\&\quad =\inf _{\rho _t}\sup _{\Phi _t}~\int _0^1\int _{\Omega }\Phi _t(\partial _t\rho _t+\text {div}(\rho \nabla \Phi _t)) dt+\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt \\&\quad =\inf _{\rho _t}~\Big \{\int _0^1\int _{\Omega }\frac{1}{2}g_x^\Omega (\nabla \Phi _t, \nabla \Phi _t)\rho _tdx dt:\partial _t\rho _t\\&\qquad +\text {div}(\rho \nabla \Phi _t)=0,~\rho _0=\rho ^0, ~\rho _1=\rho ^1\Big \}. \end{aligned} \end{aligned}$$

The third equality is derived by integration by parts w.r.t. t and the fourth equality is by switching infimum and supremum relations and integration by parts w.r.t. x.

In the above derivations, the relation of Hopf–Lax formula (22) and Hamilton–Jacobi equation (23) plays a key role for the equivalence of static and dynamic formulations of the Wasserstein metric. This is also a consequence of the fact that the sample space \(\Omega \) is a length space, i.e.,

$$\begin{aligned} d_\Omega (x,y)^2=\inf _{\gamma (t)}\Big \{\int _0^1g_{\gamma (t)}^\Omega (\dot{\gamma }, \dot{\gamma })dt:\gamma (0)=x,~\gamma (1)=y\Big \}. \end{aligned}$$

However, in a discrete sample space I, there is no path \(\gamma (t)\in I\) connecting two discrete points. Thus the relation between (22) and (23) does not hold on I. This indicates that in discrete sample spaces, the Wasserstein metric in Definition 1 can be different from the one defined by linear programming (5). See many related discussions in [12, 26].

Notations

We use the following notations.

Continuous/discrete sample space \(\Omega \) I
Inner product \(g^\Omega \) \(g^I\)
Gradient \(\nabla \) \(\nabla _G\)
divergence \(\text {div}\) \(\text {div}_G\)
Hessian in \(\Omega \) Hess  
Potential function set \(\mathcal {F}(\Omega )\) \(\mathcal {F}(I)\)
Weighted Laplacian operator \(-\nabla \cdot (\rho \nabla )\) L(p)
Continuous/discrete probability space \(\mathcal {P}_+(\Omega )\) \(\mathcal {P}_+(I)\)
Probability distribution \(\rho \) p
Tangent space \(T_\rho \mathcal {P}_+(\Omega )\) \(T_p\mathcal {P}_+(I)\)
Wasserstein metric tensor \(g^W\) \(g^W\)
Dual coordinates \(\Phi (x)\) \((\Phi _i)_{i=1}^n\)
Primal coordinates \(\sigma (x)\) \((\sigma _i)_{i=1}^n\)
First differential operator \(\delta _\rho \) \(\nabla _p\)
Second differential operator \(\delta ^2_{\rho \rho }\)  
Gradient operator   \(\nabla _W\)
Hessian operator   \(\text {Hess}_W\)
Levi–Civita connection   \(\nabla ^W_{\cdot }\cdot \)
Parameter space/Probability model \(\Theta \) \(p(\Theta )\)
Inner product \(g_\theta \) \(g_{p(\theta )}\)
Tangent space \(T_\theta \Theta \) \(T_{p(\theta )}p(\Theta )\)
\(L^2\)-Wasserstein matrix \(G(\theta )\)  
\(L^2\)-Wasserstein distance \(\text {Dist}\) \(\text {Dist}\)
Second fundamental form   \(B(\cdot , \cdot )\)
Projection operator   H
Levi–Civita connection   \((\nabla ^W_\cdot \cdot )^{||}\)
Jacobi operator \(J_\theta \)  
First differential operator \(\nabla _\theta \)  
Gradient operator \(\nabla _g\)  
Hessian operator \(\text {Hess}_g\)  

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, W., Montúfar, G. Natural gradient via optimal transport. Info. Geo. 1, 181–214 (2018). https://doi.org/10.1007/s41884-018-0015-3

Download citation

Keywords

  • Optimal transport
  • Information geometry
  • Wasserstein statistical manifold
  • Displacement convexity
  • Machine learning