Skip to main content
Log in

Wasserstein information matrix

  • Research Paper
  • Published:
Information Geometry Aims and scope Submit manuscript

Abstract

We study information matrices for statistical models by the \(L^2\)-Wasserstein metric. We call them Wasserstein information matrices (WIMs), which are analogs of classical Fisher information matrices. We introduce Wasserstein score functions and study covariance operators in statistical models. Using them, we establish Wasserstein–Cramer–Rao bounds for estimations and explore their comparisons with classical results. We next consider the asymptotic behaviors and efficiency of estimators. We derive the online asymptotic efficiency for Wasserstein natural gradient. Besides, we establish a Poincaré efficiency for Wasserstein natural gradient of maximal likelihood estimation. Several analytical examples of WIMs are presented, including location-scale families, independent families, rectified linear unit (ReLU) generative models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availibility Statement

Not Applicable.

References

  1. Amari, S.: Differential-geometrical methods in statistics. Number 28 in Lecture Notes in Statistics. Springer, Berlin; New York, corr. 2nd print edition (1990)

  2. Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)

    Article  Google Scholar 

  3. Amari, S.: Information Geometry and Its Applications. Number volume 194 in Applied mathematical sciences. Springer, Japan (2016)

  4. Amari, S., Matsuda, T.: Wasserstein statistics in one-dimensional location-scale model. Ann. Inst. Stat. Math. 74, 33–47 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  5. Arbel, M., Gretton, A., Li, W., Montufar, G.: Kernelized wasserstein natural gradient. Int. Conf. Learn. Represent. (2020)

  6. Ay, N., Jost, J., Vân Lê, H., Schwachhöfer, L.: Information Geometry, volume 64. Springer (2017)

  7. Bernton, E., Jacob, P.E., Gerber, M., Robert, C.P.: On parameter estimation with the Wasserstein distance. Inform. Inference A J. IMA 8(4), 657–676 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  8. Blanchet, J., Murthy, K., Nguyen, V.A.: Statistical analysis of Wasserstein distributionally robust estimators. In: Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pp 227–254. INFORMS (2021)

  9. Briol, F.-X., Barp, A., Duncan, A.B., Girolami, M.: Statistical inference for generative models with maximum mean discrepancy (2019). arXiv preprint arXiv:1906.05944

  10. Casella, G., Berger, R.L.: Statistical Inference, vol. 2. Duxbury, Pacific Grove (2002)

    MATH  Google Scholar 

  11. Chen, Y., Li, W.: Optimal transport natural gradient for statistical manifolds with continuous sample space. Inform. Geom. 3, 1–32 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, Hoboken, N.J, 2nd ed edition (2006)

  13. Kriegl, A., Michor, P.W.: The convenient setting of global analysis, volume 53. American Mathematical Soc (1997)

  14. Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  15. Li, W.: Transport information geometry: Riemannian calculus on probability simplex. Inform. Geom. 5, 161–207 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  16. Li, W., Montúfar, G.: Natural gradient via optimal transport. Inform. Geom. 1, 181–214 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  17. Li, W., Montúfar, G.: Ricci curvature for parametric statistics via optimal transport. Inform. Geom. 3, 89–117 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  18. Li, W., Liu, S., Zha, H., Zhou, H.; Parametric fokker-planck equation. Geom. Sci. Inform., 715–724 (2019)

  19. Li, W., Lin, A.T., Montúfar, G.: Affine natural proximal learning. Geomet. Sci. Inform., 705–714 (2019)

  20. Lin, A.T., Li, W., Osher, S., Montufar, G.: Wasserstein proximal of GANs. Geom. Sci. Inform., 524-533 (2021)

  21. Lott, J.: Some geometric calculations on Wasserstein space. Commun. Math. Phys. 277, 423–437 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  22. Mallasto, A., Haije, T.D., Feragen, A.: A formalization of the natural gradient method for general similarity measures. Geom. Sci. Inform, 599-607 (2019)

  23. Nielsen, F.: On voronoi diagrams on the information-geometric Cauchy manifolds. Entropy 22(7), 713 (2020)

    Article  MathSciNet  Google Scholar 

  24. Ollivier, Y.: Online natural gradient as a Kalman filter. Electron. J. Stat. 12(2), 2930–2961 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  25. Otto, F.: The geometry of dissipative evolution equations the porous medium equation. Commun. Part. Differ. Equ. 26(1–2), 101–174 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  26. Otto, F., Villani, C.: Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal. 173(2), 361–400 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  27. Petersen, A., Müller, H.-G.: Wasserstein covariance for multiple random densities. Biometrika 106(2), 339–351 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  28. Villani, C.: Topics in optimal transportation. Number 58. American Mathematical Soc., (2003)

  29. Villani, C.: Optimal transport: old and new, volume 338. Springer (2008)

  30. Wong, T.-K.L.: Logarithmic divergences from optimal transport and Rényi geometry. Inform. Geom. 1(1), 39–78 (2018)

    Article  MATH  Google Scholar 

  31. Zozor, S., Brossier, J.-M.: Debruijn identities: from shannon, Kullback–Leibler and Fisher to generalized \(\varphi \)-entropies, \(\varphi \)-divergences and \(\varphi \)-fisher informations. In: AIP Conference Proceedings 1641, 522–529. AIP (2015)

Download references

Funding

W. Li is supported by AFOSR MURI FA9550-18-1-0502, AFOSR YIP award 2023, and NSF RTG: 2038080.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wuchen Li.

Ethics declarations

Conflict of interest

There is no conflict of interest between authors W. Li and J.X. Zhao.

Additional information

Communicated by Nihat Ay.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proofs in Sect. 2

1.1 A.1. WIMs and score functions in analytic examples

Proof of WIMs in Gaussian families

Since we have

$$\begin{aligned} \log p(x;\mu ,\sigma )=-\frac{(x-\mu )^2}{2\sigma ^2}-\log \sigma -\log \sqrt{2\pi }, \end{aligned}$$

taking derivative, we get

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial x}\log p(x;\mu ,\sigma )&=-\frac{x-\mu }{\sigma ^2},\\ \frac{\partial }{\partial \mu }\log p(x;\mu ,\sigma )&=\frac{x-\mu }{\sigma ^2},\quad \frac{\partial }{\partial \sigma }\log p(x;\mu ,\sigma )=\frac{(x-\mu )^2}{\sigma ^3}-\frac{1}{\sigma }. \end{aligned} \end{aligned}$$

In this case, the Possion equation for Wasserstein score functions \((\Phi ^W_\mu , \Phi ^W_\sigma )\) forms

$$\begin{aligned} \left\{ \begin{aligned}&-\frac{x-\mu }{\sigma ^2}\cdot \frac{\partial }{\partial x}\Phi ^W_\mu +\frac{\partial ^2}{\partial x^2}\Phi ^W_\mu =-\frac{x-\mu }{\sigma ^2}, \\&-\frac{x-\mu }{\sigma ^2}\cdot \frac{\partial }{\partial x}\Phi ^W_\sigma +\frac{\partial ^2}{\partial x^2}\Phi ^W_\sigma =-\frac{(x-\mu )^2}{\sigma ^3}+\frac{1}{\sigma }. \end{aligned}\right. \end{aligned}$$

We simply check that \(\Phi ^W_\mu (x; \mu , \sigma ) = x - \mu \) and \(\Phi ^W_\sigma (x; \mu , \sigma )=\frac{(x-\mu )^2 - \sigma ^2}{2\sigma }\) are solutions, and they also satisfy the normalization condition \(\mathbb {E}_{p_\theta } \Phi _i^W = 0\). Thus

$$\begin{aligned} \begin{aligned} G_W(\mu , \sigma )_{\mu \mu }=&\ \mathbb {E}_{p_{\mu ,\sigma }} \left( \frac{\partial }{\partial x}\Phi ^W_\mu , \frac{\partial }{\partial x}\Phi ^W_\mu \right) =\mathbb {E}_{p_{\mu ,\sigma }} 1 =1,\\ G_W(\mu , \sigma )_{\mu \sigma }=&\ \mathbb {E}_{p_{\mu ,\sigma }} \left( \frac{\partial }{\partial x}\Phi ^W_\mu , \frac{\partial }{\partial x}\Phi ^W_\sigma \right) = \mathbb {E}_{p_{\mu ,\sigma }} \left( 1\cdot (-\frac{X-\mu }{2\sigma })\right) = 0, \\ G_W(\mu , \sigma )_{\sigma \sigma }=&\ \mathbb {E}_{p_{\mu ,\sigma }} \left( \frac{\partial }{\partial x}\Phi ^W_\sigma , \frac{\partial }{\partial x}\Phi ^W_\sigma \right) = \mathbb {E}_{p_{\mu ,\sigma }} \left( \frac{X-\mu }{\sigma }\cdot \frac{X-\mu }{\sigma } \right) = 1. \end{aligned} \end{aligned}$$

Proof of WIMs in exponential families

We derive results using the closed-form solution in 1-d. The cumulative distribution function satisfies

$$\begin{aligned} F(x;m,\lambda )={\left\{ \begin{array}{ll} 1-e^{-\lambda (x - m)}&{} x\ge m,\\ 0 &{} x<m. \end{array}\right. } \end{aligned}$$

Thus

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \lambda }F(x;m,\lambda )={\left\{ \begin{array}{ll} (x - m) e^{-\lambda (x - m)}&{}{x\ge m,}\\ 0 &{} x< m. \end{array}\right. }\\ \frac{\partial }{\partial m}F(x;m,\lambda )={\left\{ \begin{array}{ll} \lambda e^{-\lambda (x - m)}&{}{x\ge m,}\\ 0 &{} x < m. \end{array}\right. } \end{aligned} \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} \Phi ^W_\lambda (x;m,\lambda ) =&-\int _m^{x}\frac{1}{p(y;m,\lambda )}\frac{\partial }{\partial \lambda }F(y;m,\lambda )dy + C_1 \\ =&-\int _m^x\frac{(y - m)}{\lambda }dy + C_1 = \frac{(x - m)^2}{2\lambda } + C_1, \\ \Phi ^W_m(x;m,\lambda ) =&-\int _m^{x}\frac{1}{p(y;m,\lambda )}\frac{\partial }{\partial m}F(y;m,\lambda )dy + C_2 \\ =&-\int _m^x dy + C_2 = (x - m) + C_2. \\ \end{aligned} \end{aligned}$$

Using the normalization condition, we can decide integration constants appearing above. And inner products between score functions follow as

$$\begin{aligned} \begin{aligned} G_W(m,\lambda )_{\lambda \lambda } =&\ \mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_\lambda , \frac{\partial }{\partial x}\Phi ^W_\lambda \right) \\ =&\int _m^{\infty } \frac{(x - m)}{\lambda }\cdot \frac{(x - m)}{\lambda }\cdot \lambda e^{-\lambda \left( x - m \right) }dx \\ =&\ \int _m^{\infty } \frac{(x - m)^2}{\lambda }e^{-\lambda \left( x - m \right) }dx = \frac{2}{\lambda ^4}, \\ G_W(m,\lambda )_{\lambda m} =&\ \mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_\lambda , \frac{\partial }{\partial x}\Phi ^W_m \right) = \int _m^{\infty } \frac{(x - m)}{\lambda }\cdot \lambda e^{-\lambda \left( x - m \right) }dx = \frac{1}{\lambda ^2}, \\ G_W(m,\lambda )_{mm} =&\ \mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_m, \frac{\partial }{\partial x}\Phi ^W_m \right) = \int _m^{\infty } \lambda e^{-\lambda \left( x - m \right) }dx = 1. \end{aligned} \end{aligned}$$

Proof of WIMs in uniform families

The cumulative distribution function satisfies

$$\begin{aligned} F(x;a,b)={\left\{ \begin{array}{ll} 1&{} x> b,\\ \frac{x-a}{b-a} &{} a\le x \le b,\\ 0 &{} x<a. \end{array}\right. } \end{aligned}$$

Thus when \(x\in [a,b]\),

$$\begin{aligned} \frac{\partial }{\partial a}F(x;a,b)=\frac{x-b}{(b-a)^2},\quad \frac{\partial }{\partial b}F(x;a,b)=\frac{a-x}{(b-a)^2}. \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} \Phi ^W_a(x;a,b)=&-\int _a^{x}\frac{1}{p(y;a,b)}\frac{\partial }{\partial a}F(y;a,b)dy + C_1 = \frac{(x-a)(a-2b+x)}{2(b-a)} + C_1,\\ \Phi ^W_b(x;a,b) =&-\int _a^{x}\frac{1}{p(y;a,b)}\frac{\partial }{\partial b}F(y;a,b)dy + C_2 = \frac{(a-x)^2}{2(b-a)} + C_2, \end{aligned} \end{aligned}$$

where integration constants \(C_1,C_2\) can be decided via the normalization condition. Thus

$$\begin{aligned} \begin{aligned} G_W(a,b)_{aa} =&\ \mathbb {E}_{p_{a, b}} \left( \frac{\partial }{\partial x}\Phi ^W_a, \frac{\partial }{\partial x}\Phi ^W_a \right) = \frac{1}{3},\\ G_W(a,b)_{ab} =&\ \mathbb {E}_{p_{a, b}} \left( \frac{\partial }{\partial x}\Phi ^W_a, \frac{\partial }{\partial x}\Phi ^W_b \right) = \frac{1}{6},\\ G_W(a,b)_{bb} =&\ \mathbb {E}_{p_{a, b}} \left( \frac{\partial }{\partial x}\Phi ^W_b, \frac{\partial }{\partial x}\Phi ^W_b \right) = \frac{1}{3}. \end{aligned} \end{aligned}$$

Proof of the WIM in semicircle families

The cumulative distribution function satisfies

$$\begin{aligned} \begin{aligned} F(x + m;m,R)=&\ \int _{-R}^x \frac{2}{\pi R^2}\sqrt{R^2-y^2}dy\\ =&\ \int _{-\frac{\pi }{2}}^{\arcsin (\frac{x}{R})}\frac{2}{\pi R^2}\sqrt{R^2-R^2\sin ^2t}d (R\sin t)\\=&\ \int _{-\frac{\pi }{2}}^{\arcsin (\frac{x}{R})}\frac{2}{\pi R^2} R^2(\cos t)^2 dt\\ =&\ \int _{-\frac{\pi }{2}}^{\arcsin (\frac{x}{R})}\frac{1}{2\pi }\frac{\cos (2t)+1}{2} dt\\ =&\ \frac{1}{\pi }\Big (\frac{\sin (2t)}{2}+t\Big )\Bigg |_{-\frac{\pi }{2}}^{\arcsin \frac{x}{R}}\\ =&\ \frac{1}{\pi }\Big \{\frac{x\sqrt{R^2-x^2}}{R^2}+\arcsin \frac{x}{R}+\frac{\pi }{2}\Big \}, \end{aligned} \end{aligned}$$

where we use a transformation \(y=R\sin t\). Thus

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial R}F(x + m;m,R)=&\ \frac{1}{\pi }\Big \{\frac{(x\sqrt{R^2-x^2})'R^2-2Rx\sqrt{R^2-x^2}}{R^4}+(\arcsin \frac{x}{R}) '\Big \}\\ =&\ \frac{1}{\pi }\Big \{ \frac{xR(R^2-x^2)^{-\frac{1}{2}}R^2-2Rx\sqrt{R^2-x^2}}{R^4} -\frac{x}{R\sqrt{R^2-x^2}}\Big \}\\ =&\ -\frac{2x\sqrt{R^2-x^2}}{\pi R^3}. \end{aligned} \end{aligned}$$

Thus

$$\begin{aligned} \begin{aligned} \Phi ^W_R(x + m;m,R)=&\ -\int _{-R}^{x}\frac{1}{p(y;m,R)}\frac{\partial }{\partial R}F(y;m,R)dy + C \\ =&\ \int _{-R}^x\frac{y}{R}dy + C \\ =&\ \frac{1}{R}(\frac{x^2}{2}-\frac{R^2}{2}) + C. \end{aligned} \end{aligned}$$

The calculation of the score function associated with the parameter \(p\) is the same as before. And we conclude

$$\begin{aligned} \Phi ^W_p(x;m,R)= x - m. \end{aligned}$$

Thus

$$\begin{aligned} \begin{aligned} G_W(m,R)_{mm} =&\ \mathbb {E}_{p_{m,R}} \left( \frac{\partial }{\partial x}\Phi ^W_m, \frac{\partial }{\partial x}\Phi ^W_m \right) = 1, \\ G_W(m,R)_{mR} =&\ \mathbb {E}_{p_{m,R}} \left( \frac{\partial }{\partial x}\Phi ^W_R, \frac{\partial }{\partial x}\Phi ^W_m \right) = \frac{1}{R^2}\mathbb {E}_{m_{R}} \left( x - m \right) = 0, \\ G_W(m,R)_{RR} =&\ \mathbb {E}_{p_{m,R}} \left( \frac{\partial }{\partial x}\Phi ^W_R, \frac{\partial }{\partial x}\Phi ^W_R \right) = \frac{1}{R^2}\mathbb {E}_{m_{R}} \left( x - m \right) ^2 = \frac{1}{4}. \end{aligned} \end{aligned}$$

1.2 A.2. The location-scale family

Example 13

(Location-scale families) Consider a location-scale family as follows. Given a probability density function \(p(x)\) with \(\int _{\mathbb {R}}p(x)dx = 1\), we define density functions of a location-scale family with a location parameter \(m\), and a scale parameter \(\lambda \) as

$$\begin{aligned} p(x;m,\lambda ) = \frac{1}{\lambda }p\left( \frac{x - m}{\lambda }\right) ,\qquad \lambda > 0. \end{aligned}$$

Most of previously discussed examples belong to this family, except that we do not use location and scale parameters in their parameterizations. We present some geometric formulas in this setting. We further require the original density function to be symmetric according to the location parameter \(m\), i.e. \(p(x)=p(2m-x)\). Notice that a simple corollary of this assumption is \(\mathbb {E}_{p_{m,\lambda }}x=m\).

We use the closed-form solution for 1-d model to calculate the score function associated with the location parameter \(m\). Thus we have

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial m}F(x;m,\lambda ) =&\frac{\partial }{\partial m} \int _{-\infty }^x p(y;m,\lambda )dy = \frac{\partial }{\partial m} \int _{-\infty }^x \frac{1}{\lambda }p\left( \frac{y - m}{\lambda }\right) dy \\ =&- \frac{\partial }{\partial x} \int _{-\infty }^x \frac{1}{\lambda }p\left( \frac{y - m}{\lambda }\right) dy = -p(x;m,\lambda ). \end{aligned} \end{aligned}$$

Consequently, the score function associated to the parameter \(m\) satisfies

$$\begin{aligned} \Phi ^W_m(x;m,\lambda )=-\int _m^{x}\frac{1}{p(y;m,\lambda )}\frac{\partial }{\partial m}F(y;m,\lambda )dy + C_1 = \left( x - m \right) + C_1, \end{aligned}$$

where the integration constant \(C_1\) is determined to be 0. Thus we have

$$\begin{aligned} G_W(m,\lambda )_{mm} = \mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_m, \frac{\partial }{\partial x}\Phi ^W_m \right) = 1. \end{aligned}$$

For the scaling parameter \(\lambda \), we use a method of optimal transportation map to determine its score function. Namely, for two smooth distributions \(p_1, p_2\) which are absolutely continuous w.r.t. each other, their Wasserstein distance can be obtained by an optimal transportation map \(f\), i.e.

$$\begin{aligned} f_* p_1 = p_2, \quad W_2^2\left( p_1, p_2 \right) = \int _{\mathcal {X}} \left( f\left( x \right) - x \right) ^2 p_1\left( x \right) dx. \end{aligned}$$

Assume we have a tangent vector \(\frac{\partial p}{\partial \theta }\) and a smooth path \(p\left( t \right) \subset \mathcal {P}\left( \mathcal {X}\right) , t\in \left[ -\epsilon , \epsilon \right] \) with \(p\left( 0 \right) = p_0, p'\left( 0 \right) = \frac{\partial p}{\partial \theta }\). Denote the optimal transportation map between \(p\left( 0 \right) , p\left( \theta \right) \) as \(f\left( x, \theta \right) \). Then we have following relation between optimal transportation maps and the score function associated with tangent vector \(\frac{\partial p}{\partial \theta }\)

$$\begin{aligned} \frac{\partial }{\partial x}\Phi ^W\left( x \right) = \lim _{\Delta \theta \rightarrow 0}\frac{f\left( x, \Delta \theta \right) - x}{\Delta \theta }. \end{aligned}$$

First, we show that the optimal transportation map between distributions \(p(x;m_1,\lambda _1)\) and \(p(x;m_2,\lambda _2)\) is given by a linear map

$$\begin{aligned} l(x) = m_2+\frac{(x-m_1)\lambda _2}{\lambda _1}. \end{aligned}$$

As we are working in a location-scale family, it is easy to show that this map pushes \(p(x;m_1,\lambda _1)\) forward to \(p(x;m_2,\lambda _2)\), i.e. \(l_*p_{m_1,\lambda _1} = p_{m_2,\lambda _2}\). Then, we have

$$\begin{aligned} l(x) = \frac{\partial }{\partial x} \left( m_2\left( x - m_1 \right) + \frac{(x-m_1)^2\lambda _2}{2\lambda _1} \right) . \end{aligned}$$

The function in the bracket is a convex function. Therefore, \(l\left( x \right) \) is exactly the optimal transportation map between these two distributions.

To calculate the score function correspondent to the tangent vector \(\frac{\partial }{\partial \lambda }\), we consider following infinitesimal optimal transportation \(p(x;m_1,\lambda _1) \rightarrow p(x;m_1,\lambda _1 + d\lambda )\). By discussions above, the optimal transportation map is given by

$$\begin{aligned} l\left( x \right) = m_1+\frac{(x-m_1)\left( \lambda _1 + d\lambda \right) }{\lambda _1} = x + \left( x - m_1 \right) \frac{d\lambda }{\lambda _1}. \end{aligned}$$

Thus the gradient of the score function is given by

$$\begin{aligned} \frac{\partial }{\partial x}\Phi ^W_{\lambda }\left( x; m_1,\lambda _1 \right) = \frac{l\left( x \right) - x}{d\lambda } = \frac{(x-m_1)}{\lambda _1}. \end{aligned}$$

The inner product of this tangent vector is given by

$$\begin{aligned} \begin{aligned} G_W(m,\lambda )_{\lambda \lambda } =&\mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_{\lambda }, \frac{\partial }{\partial x}\Phi ^W_{\lambda } \right) \\ =&\int _{\mathbb {R}} \left( \frac{x - m}{\lambda }\right) ^2 p(x;m,\lambda )dx \\ =&\frac{\mathbb {E}_{p_{m,\lambda }}x^2-2m\mathbb {E}_{p_{m,\lambda }}x+m^2}{\lambda ^2}. \end{aligned} \end{aligned}$$

The gradient of the score function associated to the parameter \(\lambda \) (resp. \(m\)) is odd (resp. even) function when viewing as a function of \(x-m\). We conclude that the integration of their product is zero, i.e.

$$\begin{aligned} G_W(m,\lambda )_{\lambda m} = \mathbb {E}_{p_{m,\lambda }} \left( \frac{\partial }{\partial x}\Phi ^W_{\lambda }, \frac{\partial }{\partial x}\Phi ^W_m \right) = \mathbb {E}_{p_{m,\lambda }}(x - m) = 0. \end{aligned}$$

Consequently, WIMs of location-scale families are diagonal matrices, i.e.

$$\begin{aligned} G_W\left( m, \lambda \right) = \begin{pmatrix} 1 &{} 0 \\ 0 &{} \frac{\mathbb {E}_{p_{m,\lambda }}x^2-2m\mathbb {E}_{p_{m,\lambda }}x+m^2}{\lambda ^2} \end{pmatrix}. \end{aligned}$$

We next explain above closed-form solutions of WIMs by following proposition.

Proposition 16

A location-scale family \(p(x;m,\lambda )\) is a totally geodesic family in density manifold under Wasserstein metric.

Proof

It suffices to prove that for any two densities \(\rho _1 =p(x;m_1,\lambda _1)\) and \(\rho _2 = p(x;m_2,\lambda _2)\), a geodesic connecting them lies within this family. We compute the optimal transport map \(T\) associated with these two measures \(\rho _1,\rho _2\), that is

$$\begin{aligned} T = {{\,\mathrm{\textrm{argmin}}\,}}_{T_* \rho _1 = \rho _2} \int _{\mathbb {R}} \left( T(x) - x \right) ^2 \rho _1(x)dx, \end{aligned}$$

where \(T\) is a map that pushes density \(\rho _1\) forward to density \(\rho _2\). It is known that a sufficient and necessary condition for an optimal map in 1-d case is that it is a monotone map, i.e. \(\left( T(x) - T(y) \right) \left( x - y \right) \ge 0\). And in a location-scale family, such map has a closed-form solution, namely

$$\begin{aligned} T(x) = \frac{\lambda _2\left( x - m_1 \right) }{\lambda _1} + m_2. \end{aligned}$$

The geodesic \(\gamma (t):[0,1] \rightarrow \mathcal {P}(\mathbb {R})\) between \(\rho _1\) and \(\rho _2\) follows easily as below by the classical theory of optimal transport

$$\begin{aligned} \gamma (t) = \left( tx + (1-t)T(x) \right) _*\rho _1, \end{aligned}$$

where the push-forward map has a closed-form solution

$$\begin{aligned} \begin{aligned} tx + (1-t)T(x) =&\ tx + (1 - t)\frac{\lambda _2\left( x - m_1 \right) }{\lambda _1} + (1 - t)m_2 \\ =&\ \frac{\left( t\lambda _1 + (1 - t)\lambda _2 \right) \left( x - m_1 \right) }{\lambda _1} + (1 - t)m_2 + tm_1. \end{aligned} \end{aligned}$$

And by the same argument, \(\gamma (t)\) lies in this location-scale family with parameters given by

$$\begin{aligned} \lambda _t = t\lambda _1 + (1 - t)\lambda _2,\qquad m_t = (1 - t)m_2 + tm_1. \end{aligned}$$

Thus we show that geodesics between any two densities in a location-scale family lie in this family. In other words, location-scale families are totally geodesic submanifolds in density manifold. \(\square \)

Remark 11

This result on totally geodesic of location-scale families is a generalization of the same result on Gaussian families in 1-d. Both proofs of these two cases rely on the fact that optimal transport maps in these families are linear.

Remark 12

For location-scale families, we also formulate its Fisher scores and Fisher information matrices for comparisons as

$$\begin{aligned} \begin{aligned} \Phi _{m}^F(x;m,\lambda )&= \frac{p'}{\lambda p},\quad \Phi _{\lambda }^F(x;m,\lambda ) = -\frac{1}{\lambda } - \frac{(x - m)p'}{\lambda ^2 p}, \\ G_F(m,\lambda )_{\lambda \lambda }&= \int _{\mathbb {R}} p \left( \frac{\partial }{\partial \lambda } \log p \right) ^2 dx = \int _{\mathbb {R}} p \left( -\frac{1}{\lambda } - \frac{(x - m)p'}{\lambda ^2 p}\right) ^2 dx \\&= \frac{1}{\lambda ^2} \left( 1 + \int _{\mathbb {R}} \left( \frac{\left( x - m \right) ^2 p'^2}{\lambda ^2 p} + \frac{\left( x - m \right) p'}{\lambda }\right) dx \right) , \\ G_F(m,\lambda )_{mm}&= \int _{\mathbb {R}} p \left( \frac{\partial }{\partial m} \log p \right) ^2 dx = \frac{1}{\lambda ^2}\int _{\mathbb {R}} \frac{p'^2}{p} dx, \\ G_F(m,\lambda )_{m\lambda }&= \int _{\mathbb {R}} p \left( \frac{\partial }{\partial m} \log p \right) \left( \frac{\partial }{\partial \lambda } \log p \right) dx \\&= \int _{\mathbb {R}} p \left( -\frac{p'}{\lambda p} \right) \left( -\frac{1}{\lambda } - \frac{(x - m)p'}{\lambda ^2p} \right) dx \\&= \int _{\mathbb {R}} \frac{(x - m)p'^2}{\lambda ^3p} dx. \end{aligned} \end{aligned}$$
(15)

Above, we use \(p'\) to denote \(\frac{\partial }{\partial x} p\) for simplicity. WIMs and Fisher information matrices are given by

$$\begin{aligned} \begin{aligned} G_W\left( m, \lambda \right)&= \begin{pmatrix} 1 &{} 0 \\ 0 &{} \frac{\mathbb {E}_{p_{m,\lambda }}x^2-2m\mathbb {E}_{p_{m,\lambda }}x+m^2}{\lambda ^2} \end{pmatrix}, \\ G_F\left( m, \lambda \right)&= \frac{1}{\lambda ^2}\begin{pmatrix} \int _{\mathbb {R}} \frac{p'^2}{p} dx &{} \int _{\mathbb {R}} \frac{(x - m)p'^2}{\lambda p} dx \\ \int _{\mathbb {R}} \frac{(x - m)p'^2}{\lambda p} dx &{} 1 + \int _{\mathbb {R}} \left( \frac{\left( x - m \right) ^2 p'^2}{\lambda ^2 p} + \frac{\left( x - m \right) p'}{\lambda }\right) dx \end{pmatrix}, \end{aligned} \end{aligned}$$

which illustrates that WIMs are simpler than Fisher information matrices in location-scale families.

Appendix B. Functional inequalities via information matrices

In this section, we explore connections between information matrices and functional inequalities such as log-Sobolev inequalities (LSIs) and Poincaré inequalities (PIs) in statistical models. In Sect. 4, we show that these inequalities are important for the study of statistical efficiency properties.

1.1 B.1. Classical functional inequalities

Before working in statistical models, we first give a summary of relations among PIs, LSIs and dynamical quantities on density manifold.

Consider the relative entropy (KL-divergence) defined on density manifold as

$$\begin{aligned} \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) = \int _{\mathcal {X}} \log \frac{\mu (x)}{\nu (x)} \mu (x)dx, \qquad \mu \in \mathcal {P}(\mathcal {X}). \end{aligned}$$

We recall the definition of log-Sobolev inequality as below.

Definition 17

(Log-Sobolev inequality) A probability measure \(\nu \) is said to satisfy a log-Sobolev inequality with constant \(\alpha > 0\) (in short: LSI\((\alpha )\)) if we have

$$\begin{aligned} \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) < \frac{1}{2\alpha }I(\mu \Vert \nu ), \qquad \mu \in \mathcal {P}(\mathcal {X}), \end{aligned}$$

where the quantity \(I(\mu \Vert \nu )\) is the so-called Fisher-information functional

$$\begin{aligned} I(\mu \Vert \nu ) = \int _{\mathcal {X}} \left| \nabla _x \log \frac{\mu (x)}{\nu (x)} \right| ^2 \mu (x) dx, \qquad \mu \in \mathcal {P}(\mathcal {X}). \end{aligned}$$

Remark 13

If we assume that \(\mu \) is absolutely continuous w.r.t. the reference measure \(\nu \) and define function \(h\) on \(\mathcal {X}\) as

$$\begin{aligned} \mu (x) = \frac{h(x)\nu (x)}{\int _{\mathcal {X}} h(x) \nu (x)dx}, \end{aligned}$$

then above definition of LSI translates to

$$\begin{aligned} \begin{aligned}&\left( \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) \int _{\mathcal {X}}h(x)\nu (x) dx \right) \\&\quad = \int _{\mathcal {X}}h(x) \log h(x) \nu (x) dx - \left( \int _{\mathcal {X}}h(x) \nu (x) dx \right) \log \left( \int _{\mathcal {X}}h(x) \nu (x) dx \right) \\&\quad \frac{1}{2\alpha }\int _{\mathcal {X}} \frac{\left| \nabla _x h(x) \right| ^2}{h(x)} \nu (x) dx = \frac{1}{2\alpha }\left( I(\mu \Vert \nu ) \int _{\mathcal {X}}h(x) \nu (x) dx \right) . \end{aligned} \end{aligned}$$

The middle inequality is a more familiar definition of LSI\(\left( \alpha \right) \). By linearizing above formula with \(h = 1 + \epsilon f, \epsilon \rightarrow 0\), we get the classical definition of PI\(\left( \alpha \right) \)

$$\begin{aligned} \int _{\mathcal {X}}f^2(x) \nu (x) dx \le \frac{1}{\alpha }\int _{\mathcal {X}} \left| \nabla _x f(x) \right| ^2 \nu (x) dx, \qquad \int _{\mathcal {X}}f(x) \nu (x) dx = 0. \end{aligned}$$

Definition 18

(Poincaré inequalities) A probability measure \(\nu \) is said to satisfy a Poincaré inequalities with constant \(\alpha > 0\) (in short: PI\((\alpha )\)) if we have

$$\begin{aligned} \int _{\mathcal {X}}f^2(x) \nu (x) dx \le \frac{1}{\alpha }\int _{\mathcal {X}} \left| \nabla _x f(x) \right| ^2 \nu (x) dx, \qquad \forall f, \ s.t. \int _{\mathcal {X}}f(x) \nu (x) dx = 0. \end{aligned}$$

A sufficient criterion that guarantees LSIs and PIs is related to information matrices (operators or metrics in infinite dimension case) \(G_W\).

Proposition 19

Denote \({{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ), G_W(\mu )\) two bi-linear forms correspondent to Hessian of the relative entropy and Wasserstein metric.

  1. (1)

    Suppose \({{\,\mathrm{\textrm{Hess}}\,}}_{W} \text {D}_{\text {KL}}(\mu \Vert \nu ) - 2\alpha G_W(\mu )\) is a semi-positive definite bi-linear form on the Hilbert space \(T_{\mu }\mathcal {P}\left( \mathcal {X}\right) \), \(\forall \mu \in \mathcal {P}\left( \mathcal {X}\right) \). Then LSI\(\left( \alpha \right) \) holds for \(\nu \).

  2. (2)

    Suppose \({{\,\mathrm{\textrm{Hess}}\,}}_{W} \text {D}_{\text {KL}}(\nu \Vert \nu ) - 2\alpha G_W(\nu )\) is a semi-positive definite bi-linear form on the Hilbert space \(T_{\nu }\mathcal {P}\left( \mathcal {X}\right) \). Then PI\(\left( \alpha \right) \) holds for \(\nu \).

Proof

First, we prove the result concerned with LSIs. We compute the gradient of the relative entropy w.r.t. Wasserstein metric, which is given by

$$\begin{aligned} \begin{aligned} {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) =&- \nabla \cdot \left( \mu \nabla \frac{\delta }{\delta \mu }\textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) \right) = - \nabla \cdot \left( \mu \nabla \log \frac{\mu (x)}{\nu (x)} \right) , \end{aligned} \end{aligned}$$

where \(\frac{\delta }{\delta \mu }\) refers to the \(L^2\) functional derivative. Thus it is easy to obtain the relative entropy dissipation along the gradient flow as

$$\begin{aligned} \begin{aligned}&\ \frac{d}{dt} \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) \\&\quad = \ - g_W \left( {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ), {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu \Vert \nu ) \right) \\&\quad = \ - \int _{\mathcal {X}} \left| \nabla _x \log \frac{\mu (x)}{\nu (x)} \right| ^2 \mu (x) dx = - I(\mu \Vert \nu ). \end{aligned} \end{aligned}$$
(16)

Using the assumption, we have

$$\begin{aligned} \begin{aligned} \frac{d^2}{dt^2} \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) =&\ {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu )\left( {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) , {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) \right) \\ \ge&\ 2\alpha G_W(\mu _t) \left( {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) , {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) \right) \\ =&\ - 2\alpha \frac{d}{dt} \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ), \end{aligned} \end{aligned}$$

from which LSI\((\alpha )\) holds via integrating the above formula, i.e.

$$\begin{aligned} \begin{aligned}&\ I(\mu _t\Vert \nu ) = I(\mu _t\Vert \nu ) - I(\nu \Vert \nu ) \\&\quad = \ \int _t^{\infty } \left( \frac{d^2}{dt^2} \textrm{D}_{\textrm{KL}}(\mu _{\tau }\Vert \nu ) \right) d\tau \\&\quad \ge \ 2\alpha \int _t^{\infty } \left( -\frac{d}{dt} \textrm{D}_{\textrm{KL}}(\mu _{\tau }\Vert \nu ) \right) d\tau \\&\quad = \ 2\alpha \left( \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ) - \textrm{D}_{\textrm{KL}}(\nu \Vert \nu ) \right) \\&\quad = \ 2\alpha \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu ), \end{aligned} \end{aligned}$$

where we use the fact that this gradient flow \(\mu _t\) converges to \(\nu \) and \(\textrm{D}_{\textrm{KL}}(\nu \Vert \nu ) = I(\nu \Vert \nu ) = 0\).

To prove the conclusion of Poincaré inequalities, we consider a path in density manifold, i.e \(\rho \left( \epsilon \right) = \nu \left( 1 + \epsilon f \right) , \int _{\mathcal {X}}f(x) \nu (x) dx = 0\). Since we have

$$\begin{aligned} \begin{aligned} \textrm{D}_{\textrm{KL}}\left( \rho \left( \epsilon \right) \Vert \nu \right)&= \frac{\epsilon ^2}{2}\int _{\mathcal {X}}f^2(x) \nu (x) dx + o\left( \epsilon ^2 \right) , \\ - \frac{d}{dt} \textrm{D}_{\textrm{KL}}(\rho \left( \epsilon \right) \Vert \nu ) = I\left( \rho \left( \epsilon \right) \Vert \nu \right)&= \epsilon ^2\int _{\mathcal {X}} \left| \nabla _x f(x) \right| ^2 \nu (x) dx + o\left( \epsilon ^2 \right) . \end{aligned} \end{aligned}$$

Consequently, we obtain

$$\begin{aligned} \begin{aligned}&\ \frac{\int _{\mathcal {X}}f^2(x) \nu (x) dx}{\int _{\mathcal {X}} \left| \nabla _x f(x) \right| ^2 \nu (x) dx} \\&\quad = \ \frac{1}{2}\lim _{\epsilon \rightarrow 0} - \frac{\textrm{D}_{\textrm{KL}}\left( \rho \left( \epsilon \right) \Vert \nu \right) }{\frac{d}{d\epsilon } \textrm{D}_{\textrm{KL}}(\rho \left( \epsilon \right) \Vert \nu )} \\&\quad = \ \frac{1}{2}\lim _{\epsilon \rightarrow 0} - \frac{\frac{d}{d\epsilon }\textrm{D}_{\textrm{KL}}\left( \rho \left( \epsilon \right) \Vert \nu \right) }{\frac{d^2}{d\epsilon ^2} \textrm{D}_{\textrm{KL}}(\rho \left( \epsilon \right) \Vert \nu )} \\&\quad = \ \frac{1}{2}\lim _{\epsilon \rightarrow 0} \frac{G_W\left( \rho \left( 0 \right) \right) \left( \frac{d}{d\epsilon }\rho \left( 0 \right) , \frac{d}{d\epsilon }\rho \left( 0 \right) \right) }{{{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(\rho \left( 0 \right) \Vert \nu )\left( \frac{d}{d\epsilon }\rho \left( 0 \right) , \frac{d}{d\epsilon }\rho \left( 0 \right) \right) } \\&\quad \le \ \frac{1}{\alpha }, \end{aligned} \end{aligned}$$

where we use L’Hopital’s rule in second equality and the third equality holds because of the assumption that \({{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(\nu \Vert \nu ) - 2\alpha G_W(\nu )\) is semi-definite. \(\square \)

Remark 14

With the help of (16), readers can recognize that LSI guarantees a global exponential convergence of the gradient flow of the relative entropy. Indeed, suppose \(\mu _t\) is a gradient flow of \(\textrm{D}_{\textrm{KL}}(\cdot \Vert \nu )\) starting from \(\mu _0\), then we have

$$\begin{aligned} \begin{aligned} \textrm{D}_{\textrm{KL}}(\mu _t\Vert \nu )&\le e^{-2\alpha t}\textrm{D}_{\textrm{KL}}(\mu _0\Vert \nu ),\qquad \mu _0 \in \mathcal {P}(\mathcal {X}) \text { (LSI}\left( \alpha \right) ) . \end{aligned} \end{aligned}$$

While intuitively speaking, a PI can be viewed as an infinitesimal version of a LSI, that is to consider the dynamics in a neighborhood of the optimal value.

1.2 B.2. LSIs and PIs in families

Now, it is clear that PIs and LSIs are related to density manifold. We attempt to find those counterparts in statistical models, i.e. submanifolds.

Now, we fix a model \(\Theta \subset \mathcal {P}(\mathcal {X})\) with metric given by \(G_W\). The relative entropy is indeed a restriction of global functional to this family. And we furthermore require the reference measure \(\nu \) to lie in this family, i.e. \(\nu = p_{\theta _*},\theta _* \in \Theta \). We use \(\widetilde{}\) to distinguish constraint cases (statistical models) from the global situation (density manifold). Recall that the Fisher information functional is merely the relative entropy dissipation along a gradient flow. Thus we have

$$\begin{aligned} \begin{aligned} \widetilde{I}(p_{\theta _t}|p_{\theta _*}) =&- \frac{d}{dt} \textrm{D}_{\textrm{KL}}(p_{\theta _t}\Vert p_{\theta _*})\\ =&\ g_W\left( {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*}), {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*}) \right) \\ =&\ \left( \nabla _{\theta } \textrm{D}_{\textrm{KL}} \right) ^T \left( \widetilde{G}_W^{-1} \right) ^T \widetilde{G}_W \widetilde{G}_W^{-1}\nabla _{\theta } \textrm{D}_{\textrm{KL}} \\ =&\ \left( \nabla _{\theta } \textrm{D}_{\textrm{KL}} \right) ^T \widetilde{G}_W^{-1}\nabla _{\theta } \textrm{D}_{\textrm{KL}}, \end{aligned} \end{aligned}$$
(17)

where we use a fact

$$\begin{aligned} {{\,\mathrm{\textrm{grad}}\,}}_W \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*}) = \widetilde{G}_W^{-1}\nabla _{\theta } \textrm{D}_{\textrm{KL}}. \end{aligned}$$

Definition 20

(LSI in family) Consider a statistical model \(p: \mathcal {X} \times \Theta \rightarrow \mathbb {R}\), a probability measure \(p_{\theta _*}\) is said to satisfy LSI\((\alpha )\) in \(\Theta \) with constant \(\alpha > 0\) (in short: LSI\((\alpha )\)) if we have

$$\begin{aligned} \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*}) < \frac{1}{2\alpha }\widetilde{I}(p_\theta \Vert p_{\theta _*}), \qquad \theta \in \Theta . \end{aligned}$$

Using information matrices, we seek a sufficient condition for LSIs and PIs as proposition 19, i.e.

$$\begin{aligned} {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*}) \ge 2\alpha \widetilde{G}_W(\theta ), \end{aligned}$$

where we have to take care that the Hessian on LHS is calculated in a submanifold instead of density manifold. Fisher information matrix also comes into this picture, via a decomposition of the Hessian term \({{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _*})\). This point is known as the Ricci-information-Wasserstein (RIW) condition.

Theorem 21

(RIW-condition) The information matrices criterion for LSI\(\left( \alpha \right) \) of distribution \(p_{\theta {*}}\) is given by

$$\begin{aligned} G_F\left( \theta \right) + \nabla _{\theta }^2p_\theta \log \frac{p_\theta }{p_{\theta _*}} - \Gamma ^W \nabla _{\theta }\textrm{D}_{\textrm{KL}}(p_\theta \Vert p_{\theta _{*}}) \ge 2\alpha G_W(\theta ), \end{aligned}$$

where \(\Gamma ^W\)s are Christoffel symbols in Wasserstein statistical model \(\Theta \), while for PI\(\left( \alpha \right) \) of distribution \(p_{\theta {*}}{} \) can be written as

$$\begin{aligned} G_F\left( \theta \right) + \nabla _{\theta }^2p_\theta \log \frac{p_\theta }{p_{\theta _*}} \ge 2\alpha G_W(\theta ). \end{aligned}$$

Remark 15

It can be seen that the condition for log-Sobolev inequalities is much more complicated than that of Poincaré inequalities. For LSIs require a global convexity of the entropy while PIs only correspond to local behavior at the minimum. The most significant change takes place in the Hessian term of entropy, where Wasserstein Christoffel symbols come in.

1.3 B.3. Examples in 1-d Family

Both LSIs and PIs can be proved by using Wasserstein and Fisher information matrices. Previously, we have done geometric computations on metric tensor and Hessian of the entropy. This prepares ingredients for us to establish inequalities in families of probability distributions. In this section, we utilize previous calculations to obtain concrete bounds on these functional inequalities.

Example 14

(Gaussian distribution) Recall that for a Gaussian distribution with mean value \(\mu \) and standard variance \(\sigma \), the Wasserstein and Fisher information matrices are given by

$$\begin{aligned} G_W(\mu ,\sigma ) = \begin{pmatrix} 1 &{} 0 \\ 0 &{} 1 \end{pmatrix}, \qquad G_F(\mu ,\sigma ) = \begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{2}{\sigma ^2} \end{pmatrix}. \end{aligned}$$

The entropy and the relative entropy defined on this model are provided by

$$\begin{aligned} \begin{aligned} H(\mu ,\sigma )&= - \frac{1}{2}\log 2\pi - \log \sigma - \frac{1}{2}, \\ \textrm{D}_{\textrm{KL}}(\mu ,\sigma \Vert p_*)&= - \log \sigma + \log \sigma _* - \frac{1}{2}+ \frac{\sigma ^2 + \left( \mu - \mu _* \right) ^2}{2\sigma _*^2},\qquad p_* \sim p_{\mu _*,\sigma _*}. \end{aligned} \end{aligned}$$

We can calculate Wasserstein gradients associated with these two functionals

$$\begin{aligned} \begin{aligned} \nabla _{\mu ,\sigma }^W H(\mu ,\sigma ) = \left( \begin{aligned}&0 \ \\ -&\frac{1}{\sigma } \ \end{aligned}\right) , \quad \nabla _{\mu ,\sigma }^W \textrm{D}_{\textrm{KL}}(\mu ,\sigma \Vert p_*) = \left( \begin{aligned}&\frac{\mu - \mu _*}{\sigma _*^2} \ \\ -&\frac{1}{\sigma } + \frac{\sigma }{\sigma _*^2} \ \end{aligned}\right) , \end{aligned} \end{aligned}$$

with the correspondent Fisher information functionals as

$$\begin{aligned} \begin{aligned} \widetilde{I}(\mu ,\sigma )&= \frac{1}{\sigma ^2}, \\ \widetilde{I}(\mu ,\sigma \Vert p_*)&= \frac{\left( \mu - \mu _* \right) ^2}{\sigma _*^4} + \left( - \frac{1}{\sigma } + \frac{\sigma }{\sigma _*^2} \right) ^2. \end{aligned} \end{aligned}$$

Thus, the LSI\(\left( \alpha \right) \) for Gaussian \(p_{\mu _*,\sigma _*}\) is given by

$$\begin{aligned} \begin{aligned} \frac{\left( \mu - \mu _* \right) ^2}{\sigma _*^4} + \left( - \frac{1}{\sigma } + \frac{\sigma }{\sigma _*^2} \right) ^2&\ge 2\alpha \left( - \log \sigma + \log \sigma _* - \frac{1}{2}+ \frac{\sigma ^2 + \left( \mu - \mu _* \right) ^2}{2\sigma _*^2} \right) . \end{aligned} \end{aligned}$$

Next, we move onto the derivation of the RIW condition. It suffices to consider a relation between \(\widetilde{G}_W, {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}\) at each point in a statistical model. Recall the formula for Hessian in Riemannian geometry

$$\begin{aligned} \left( {{\,\mathrm{\textrm{Hess}}\,}}f \right) _{ij} = \partial _i\partial _j f - \Gamma _{ij}^{k(W)} \partial _k f, \end{aligned}$$

where \(\Gamma ^{W}\)s are Christoffel symbols in Wasserstein geometry. In Wasserstein Gaussian model where the metric is Euclidean, Christoffel symbols vanish, i.e. \(\Gamma ^{W} = 0\). Thus we have

$$\begin{aligned} {{\,\mathrm{\textrm{Hess}}\,}}_{W} H\left( \mu , \sigma \right) = \begin{pmatrix} 0 &{} 0 \\ 0 &{} \frac{1}{\sigma ^2} \end{pmatrix}, \qquad {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}\left( \mu , \sigma \Vert p_{*} \right) = \begin{pmatrix} \frac{1}{\sigma _*^2} &{} 0 \\ 0 &{} \frac{1}{\sigma ^2} + \frac{1}{\sigma _*^2} \end{pmatrix}. \end{aligned}$$

For a gradient flow of the relative entropy w.r.t. a Gaussian \(p_{\mu _*, \sigma _*}\), we conclude that

$$\begin{aligned} {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}(\mu ,\sigma \Vert p_{\theta _*}) \ge \left( \frac{1}{\sigma _*^2} \right) G_W(\mu ,\sigma ), \end{aligned}$$

since \(G_W(\mu ,\sigma )\) is exactly an identity matrix. In other words, the Gaussian \(p_{\mu _*,\sigma _*}\) satisfies a LSI\(\left( \frac{1}{2\sigma _*^2}\right) \) in a Gaussian model. Notice this result coincides with the one in global case, which is simple consequene of the fact that Gaussian family is totally geodesic in the Wasserstein density manifold.

Next, for the gradient flow of the entropy \(H\left( \cdot \right) \), we do not have a satisfying constant \(\alpha \) such that the Hessian condition proposition 19 holds. For \({{\,\mathrm{\textrm{Hess}}\,}}_{W} H(\mu ,\sigma )\) matrix has an eigenvalue \(0\). Despite of this, we have

$$\begin{aligned} {{\,\mathrm{\textrm{grad}}\,}}_W H(\mu ,\sigma ) = G_W^{-1}\nabla _{\mu ,\sigma } H(\mu ,\sigma ) = \nabla _{\mu ,\sigma }H(\mu ,\sigma ), \end{aligned}$$

whose \(\mu \) component always vanishes. Thus the gradient direction of \(H(\cdot )\) always coincides with \(\sigma \) direction, in which we have eigenvalue’s bound: \({{\,\mathrm{\textrm{eig}}\,}}_{\sigma }(H) \ge \frac{1}{\sigma ^2} {{\,\mathrm{\textrm{eig}}\,}}_{\sigma }(G_W)\). This refers to that eigenvalues of two matrices correspond to direction \(\frac{\partial }{\partial \mu }\) have a bound. For LSIs, if the range of \(\sigma \) is the whole \(\mathbb {R}\), then it is easy to see there will not exist a satisfying constant \(\alpha > 0\) for LSI\((\alpha )\) to hold, i.e. \(\frac{1}{\sigma ^2} \ge 2\alpha , \ \forall \sigma \in \mathbb {R}\). However, if we restrict the range of \(\sigma \) to a bounded region such as \([-M,M]\), then LSI\((\frac{1}{2M^2})\) will hold.

Remark 16

Above calculation on gradient flows of the entropy does not establish LSI\((\alpha )\) for any specific distribution. It merely provides an example of using Hessian condition to study dynamical behaviors.

Example 15

(Laplacian distribution) Consider the case of Laplacian distribution, where

$$\begin{aligned} G_W(m, \lambda ) = \begin{pmatrix} 1 &{} 0 \\ 0 &{} \frac{2}{\lambda ^4} \end{pmatrix}, \quad G_F(m, \lambda ) = \begin{pmatrix} \lambda ^2 &{} 0 \\ 0 &{} \frac{1}{\lambda ^2} \end{pmatrix}, \end{aligned}$$

from which we can calculate the Christoffel symbol as

$$\begin{aligned} \begin{aligned} \Gamma _{22}^{2(W)}(m, \lambda )&= \frac{g^{-1}_{22}}{2} (\partial _2 g_{22} + \partial _2 g_{22} - \partial _2 g_{22}) = \frac{g^{-1}_{22}}{2} \partial _2 g_{22} = \frac{\lambda ^4}{4} \cdot \left( - \frac{8}{\lambda ^5} \right) = - \frac{2}{\lambda }, \\ \Gamma _{ij}^{k(W)}(m, \lambda )&= 0 \qquad \text {otherwise}. \end{aligned} \end{aligned}$$

Following the same procedure we have done before, the entropy and the relative entropy w.r.t. \(p_{m_*, \lambda _*}\) defined on this model is provided by

$$\begin{aligned} \begin{aligned} H(m,\lambda )&= - 1 + \log \lambda - \log 2 , \\ \textrm{D}_{\textrm{KL}}(m,\lambda \Vert p*)&= - 1 + \log \lambda - \log \lambda _* + \lambda _*\left| m - m_* \right| + \frac{\lambda _* e^{ - \lambda \left| m - m_* \right| }}{\lambda }, \end{aligned} \end{aligned}$$

from which we can calculate Wasserstein gradients associated with two functionals

$$\begin{aligned} \begin{aligned}&\nabla _{m,\lambda }^W H(m,\lambda ) = \left( \begin{aligned} \ {}&0 \ \\ \ {}&\frac{1}{\lambda } \ \end{aligned}\right) , \\&\nabla _{m,\lambda }^W \textrm{D}_{\textrm{KL}}(m,\lambda \Vert p_*) = \left\{ \begin{aligned}&\left( \begin{aligned}&\lambda _*\left( 1 - e^{ - \lambda \left( m - m_* \right) }\right) \\&- \frac{\left( \lambda \left( m - m_* \right) + 1 \right) \lambda _*e^{ - \lambda \left( m - m_* \right) } - \lambda }{\lambda ^2} \end{aligned}\right) , \ m > m_*, \\&\left( \begin{aligned}&- \lambda _*\left( 1 - e^{ - \lambda \left( m_* - m \right) }\right) \\&- \frac{\left( \lambda \left( m_* - m \right) + 1 \right) \lambda _*e^{ - \lambda \left( m_* - m \right) } - \lambda }{\lambda ^2} \end{aligned}\right) , \ m < m_*, \end{aligned}\right. \end{aligned} \end{aligned}$$

with the Fisher information functionals as

$$\begin{aligned} \begin{aligned} \widetilde{I}(m,\lambda )&= \frac{\lambda ^2}{2}, \\ \widetilde{I}(m,\lambda \Vert p_*)&= \lambda _*^2\left( 1 - e^{ - \lambda \left| m - m_* \right| }\right) ^2 + \frac{\left( \left( \lambda \left| m - m_* \right| + 1 \right) \lambda _*e^{ - \lambda \left| m - m_* \right| } - \lambda \right) ^2}{2}. \end{aligned} \end{aligned}$$

Notice that the value of \(\nabla _{m,\lambda }^W \textrm{D}_{\textrm{KL}}(m,\lambda \Vert p_*)\) is not well-defined at point \(m = m_*\). However, what we considered is integral on the whole \(\mathbb {R}\). Thus we can simply ignore its value at \(m = m_*\). As before, LSI\(\left( \alpha \right) \) is given by

$$\begin{aligned} \begin{aligned}&\lambda _*^2\left( 1 - e^{ - \lambda \left| m_* - m \right| }\right) ^2 + \frac{\left( \left( \lambda \left| m_* - m \right| + 1 \right) \lambda _*e^{ - \lambda \left| m_* - m \right| } - \lambda \right) ^2}{2} \\&\quad \ge \ 2\alpha \left( - 1 + \log \lambda - \log \lambda _* + \lambda _*\left| m - m_* \right| + \frac{\lambda _* e^{ - \lambda \left| m - m_* \right| }}{\lambda } \right) . \end{aligned} \end{aligned}$$

And we find Hessians of the entropy and the relative entropy in \((\Theta , G_W)\) are given by

$$\begin{aligned} \begin{aligned} {{\,\mathrm{\textrm{Hess}}\,}}_{W} H\left( m,\lambda \right)&= \begin{pmatrix} 0 &{} 0 \\ 0 &{} \frac{1}{\lambda ^2} \end{pmatrix}, \\ {{\,\mathrm{\textrm{Hess}}\,}}_{W} \textrm{D}_{\textrm{KL}}\left( m,\lambda \Vert p_* \right)&= \begin{pmatrix} \lambda \lambda _* e^{ - \lambda \left| m - m_* \right| } &{} 0 \\ 0 &{} \frac{1}{\lambda ^2} + \frac{\lambda _* e^{ - \lambda \left| m - m_* \right| }\left( m_* - m \right) ^2 }{\lambda ^3} \end{pmatrix}. \end{aligned} \end{aligned}$$

Following the same analysis, we conclude that for gradient flows of the entropy \(\textrm{D}_{\textrm{KL}}(m,\lambda )\), a LSI\((\frac{\lambda ^2}{4})\) holds. While for the relative entropy \(\textrm{D}_{\textrm{KL}}(m,\lambda \Vert p_{m_*, \lambda _*})\), Hessian condition can be written as

$$\begin{aligned} \begin{pmatrix} \lambda \lambda _* e^{ - \lambda \left| m - m_* \right| } &{} 0 \\ 0 &{} \frac{1}{\lambda ^2} + \frac{\lambda _* e^{ - \lambda \left| m - m_* \right| }\left( m_* - m \right) ^2 }{\lambda ^3} \end{pmatrix} \ge \alpha \begin{pmatrix} 1 &{} 0 \\ 0 &{} \frac{2}{\lambda ^4} \end{pmatrix}, \end{aligned}$$

which can be reformulated as

$$\begin{aligned} \begin{aligned} \alpha = \min _{m,\lambda }\left\{ \lambda \lambda _* e^{ - \lambda \left| m - m_* \right| }, \frac{1}{2}\left( \lambda ^2 + \lambda _* e^{ - \lambda \left| m - m_* \right| }\lambda \left( m_* - m \right) ^2 \right) \right\} . \end{aligned} \end{aligned}$$

From above formula, we conclude that in order to find a satisfying constant, it suffices to restrict the region of \(m \in \left[ -M, M \right] ,\lambda \in \left[ N, \infty \right) \). The distribution La(\(m_*, \lambda _*\)) satisfies a LSI\((\alpha )\) in Laplacian family with \(\alpha \) given above.

Example 16

(Independent model) For an independent family \(p\left( x, y; \theta \right) = p_1\left( x; \theta \right) p\left( y; \theta \right) \), we have

$$\begin{aligned} G_W = G_W^1 + G_W^2, \qquad G_F = G_F^1 + G_F^2. \end{aligned}$$

The entropy and the relative entropy also have this separability property

$$\begin{aligned} \begin{aligned} H\left( \theta \right)&= H_1\left( \theta \right) + H_2\left( \theta \right) , \\ \textrm{D}_{\textrm{KL}}\left( \theta \Vert p_*\right)&= \textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) , \\ \nabla _{\theta } \textrm{D}_{\textrm{KL}}\left( \theta \Vert p_*\right)&= \nabla _{\theta } \textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) . \end{aligned} \end{aligned}$$

The Fisher information functional is given by

$$\begin{aligned} \begin{aligned}&\ I\left( p_{\theta }| p_* \right) \\&\quad = \ \left( \nabla _{\theta }\textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) \right) ^T \left( G_W^1 + G_W^2 \right) ^{-1} \\&\quad \left( \nabla _{\theta }\textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) \right) , \end{aligned} \end{aligned}$$

with LSI\(\left( \alpha \right) \) given by

$$\begin{aligned} \begin{aligned}&\ \left( \nabla _{\theta }\textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) \right) ^T \left( G_W^1 + G_W^2 \right) ^{-1} \left( \nabla _{\theta }\textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) \right) \\&\quad \ge \ 2\alpha \nabla _{\theta }\textrm{D}_{\textrm{KL}}^1\left( \theta \Vert p_{1*}\right) + \nabla _{\theta }\textrm{D}_{\textrm{KL}}^2\left( \theta \Vert p_{2*} \right) . \end{aligned} \end{aligned}$$

In conclusion, above examples introduce another way to prove functional inequalities as well as convergence rates of dynamics in probability families.

Appendix C. Proofs in Sect. 4

1.1 C.1. Proof of Theorem 13

Proof of Theorem 13

First, we postulate that \(\nabla _x\) refers to the gradient w.r.t. \(x\) variable while \(\nabla _{\theta }\) refers to the gradient w.r.t. \(\theta \) variable. We expand the function \(l(x_t,\theta _t)\)

$$\begin{aligned} l(x_t,\theta _t) = l(x_t,\theta _*) + \nabla _{\theta }l(x_t,\theta _*)\left( \theta _t - \theta _* \right) + O\left( \left| \theta _t - \theta _* \right| ^2 \right) . \end{aligned}$$

By substrating \(\theta _*\) in both sides of the updating equation and plugging in the expansion above, we get

$$\begin{aligned} \begin{aligned} \theta _{t+1} - \theta _* =&\ \left( \theta _{t} - \theta _* \right) - \frac{1}{t}G_W^{-1}(\theta _t)\left( l(x_t,\theta _*) + \nabla _{\theta }l(x_t,\theta _*)\left( \theta _t - \theta _* \right) \right. \\&\ + \left. O\left( \left| \theta _t - \theta _* \right| ^2 \right) \right) . \end{aligned} \end{aligned}$$

Then, taking Wasserstein covariances of both sides, we get

$$\begin{aligned} \begin{aligned} V_{t+1} = V_t&+ \frac{1}{t^2}\mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( G_W^{-1}(\theta _t) l(x_t,\theta _t) \right) \cdot \nabla _x \left( l(x_t,\theta _t)^{T} G_W^{-1}(\theta _t) \right) \right] \\&- \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} G_W^{-1}(\theta _t) \right) \right] + o\left( \frac{V_t}{t} \right) \\&- \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( \left( \theta _t - \theta _* \right) ^T\nabla _{\theta }l(x_t,\theta _*)^T G_W^{-1} (\theta _t) \right) \right] , \end{aligned} \end{aligned}$$

where the last term corresponds to an expansion term \(O\left( \left| \theta _t - \theta _* \right| ^2 \right) \) and we use an assumption that \(\mathbb {E}_{p_{\theta _*}} \left[ \left( \theta _t - \theta _* \right) ^2 \right] , \mathbb {E}_{p_{\theta _*}} \left[ \left| \nabla _x \left( \theta _t - \theta _* \right) \right| ^2 \right] = o(1)\). In above formula, we eliminate transpose symbols \({}^T\) on metric tensor \(G_W\) because of its symmetry. For the second term on the RHS, we have

$$\begin{aligned} \begin{aligned}&\ \frac{1}{t^2}\mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( G_W^{-1}(\theta _t) l(x_t,\theta _t) \right) \cdot \nabla _x \left( l(x_t,\theta _t)^{T} G_W^{-1}(\theta _t) \right) \right] \\&\quad = \ \frac{1}{t^2}\mathbb {E}_{p_{\theta _*}} \left[ G_W^{-1}(\theta _*)\nabla _x \left( l(x_t,\theta _*) \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} \right) G_W^{-1}(\theta _*) \right] + o \left( \frac{1}{t^2} \right) \\&\quad = \ \frac{1}{t^2} G_W^{-1}(\theta _*) \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( l(x_t,\theta _*) \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} \right) \right] G_W^{-1}(\theta _*) + o \left( \frac{1}{t^2} \right) , \end{aligned} \end{aligned}$$

where we use the following fact

$$\begin{aligned} \begin{aligned}&\ \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( G_W^{-1}(\theta _t) l(x_t,\theta _t) \right) \cdot \nabla _x \left( l(x_t,\theta _t)^{T} G_W^{-1}(\theta _t) \right) \right] \\&\qquad \ - \mathbb {E}_{p_{\theta _*}} \left[ G_W^{-1}(\theta _*)\nabla _x \left( l(x_t,\theta _*) \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} \right) G_W^{-1}(\theta _*) \right] \\&\quad = \ O\left( \mathbb {E}_{p_{\theta _*}} \left[ \left| \theta _t - \theta _* \right| \right] \right) = o\left( 1 \right) . \end{aligned} \end{aligned}$$

And the third term in the RHS can be reduced according have

$$\begin{aligned} \begin{aligned}&- \frac{2}{t} \mathbb {E}_{p_{\theta _*}}\left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} G_W^{-1}(\theta _t) \right) \right] \\&\quad = - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} \right) G_W^{-1}(\theta _t) \right] \\&\qquad - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot l(x_t,\theta _*)^{T} \nabla _x G_W^{-1}(\theta _t) \right] \\&\quad = \ 0, \end{aligned} \end{aligned}$$

where the first term vanishes because \(\nabla _x \left( \theta _t - \theta _* \right) \) only has non-vanishing components at \(x_1,...,x_{t-1}\) while \(\nabla _x \left( f(x_t,\theta _*)^{T} \right) \) only has a non-vanishing component at \(x_t\). Consequently their inner product vanishes everywhere. While the second term vanishes by considering each element of this matrix, we have

$$\begin{aligned} \begin{aligned}&\left( \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot l(x_t,\theta _*)^{T} \nabla _x G_W^{-1}(\theta _t) \right] \right) _{ij} \\&\quad = \ \frac{2}{t}\mathbb {E}_{p_{\theta _*}}\nabla _x \left( \theta _t - \theta _* \right) _i \cdot \left( l(x_t,\theta _*)^{T} \nabla _x G_W^{-1}(\theta _t) \right) _j \\&\quad = \ \frac{2}{t}\mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) _i \cdot \nabla _x \left( G_W^{-1}(\theta _t) _{kj} \right) l(x_t,\theta _*)^{T}_k \right] \\&\quad = \ \frac{2}{t}\mathbb {E}_{p_{\theta _*}}\left[ \nabla _x \left( \theta _t - \theta _* \right) _i \cdot \nabla _x \left( G_W^{-1}(\theta _t) _{kj} \right) \right] \mathbb {E}_{p_{\theta _*}}l(x_t,\theta _*)^{T}_k \\&\quad = \ 0, \end{aligned} \end{aligned}$$

where the third equality is guaranteed by the fact that \(\theta _t - \theta _{*}\) is independent to \(\nabla _{\theta }f(x_t,\theta _*)\) since \(\theta _t, x_t\) are mutually independent. While the last equality holds by an assumption as

$$\begin{aligned} \mathbb {E}_{p_{\theta _*}}l(x_t,\theta _*) = 0. \end{aligned}$$

For the last term, same as the analysis of the third term, we find

$$\begin{aligned} \begin{aligned}&- \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( \left( \theta _t - \theta _* \right) ^T\nabla _{\theta }l(x_t,\theta _*)^T G_W^{-1} (\theta _t) \right) \right] \\&\quad = - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( \left( \theta _t - \theta _* \right) ^T \right) \nabla _{\theta }l(x_t,\theta _*)^T G_W^{-1} (\theta _t) \right] \\&\qquad - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \left( \theta _t - \theta _* \right) ^T \nabla _x \left( \nabla _{\theta }l(x_t,\theta _*)^T \right) G_W^{-1} (\theta _t) \right] \\&\qquad - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \left( \theta _t - \theta _* \right) ^T \nabla _{\theta }l(x_t,\theta _*)^T \nabla _x \left( G_W^{-1} (\theta _t) \right) \right] \\&\quad = - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( \left( \theta _t - \theta _* \right) ^T \right) \nabla _{\theta }l(x_t,\theta _*)^T G_W^{-1} (\theta _*) \right] + o\left( \frac{V_t}{t}\right) \\&\quad - \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \left( \theta _t - \theta _* \right) ^T \nabla _{\theta }l(x_t,\theta _*)^T \nabla _x \left( G_W^{-1} (\theta _t) \right) \right] , \end{aligned} \end{aligned}$$

where we again use the independent relation between \(\left( \theta _t - \theta _* \right) \) and \(f(x_t,\theta _*)\). The additional term appearing above, with the help that \(\mathbb {E}_{p_{\theta _*}} \left[ \nabla _{\theta }f(x_t,\theta _*) \right] = O(1)\), \(\nabla _x \left( G_W^{-1} (\theta _t) \right) = \nabla _x \theta _t \nabla _{\theta } \left( G_W^{-1} (\theta _t) \right) = O(\nabla _x \left( \theta _t - \theta _* \right) )\), can be further reduced to the form below

$$\begin{aligned} \begin{aligned}&\ \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \left( \theta _t - \theta _* \right) ^T \nabla _{\theta }l(x_t,\theta _*) \nabla _x \left( \left( G_W^{-1} (\theta _t) \right) \right) \right] \\&\quad = \ \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \left( \theta _t - \theta _* \right) ^T O(1) \nabla _x \left( \theta _t - \theta _* \right) O(1) \right] \\&\quad \le \ \frac{O(1)}{t} \sqrt{\mathbb {E}_{p_{\theta _*}} \left[ \left| \nabla _x \left( \theta _t - \theta _* \right) \right| ^2 \right] \mathbb {E}_{p_{\theta _*}} \left[ \left( \theta _t - \theta _* \right) ^2 \right] \mathbb {E}_{p_{\theta _*}} \left[ \left| \nabla _x \left( \theta _t - \theta _* \right) \right| ^2 \right] } \\&\quad = \ o\left( \frac{V_t}{t} \right) . \end{aligned} \end{aligned}$$

And the last term finally reduces have

$$\begin{aligned} \begin{aligned}&- \frac{2}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( \theta _t - \theta _* \right) \cdot \nabla _x \left( \left( \theta _t - \theta _* \right) ^T \right) \nabla _{\theta }l(x_t,\theta _*) \left( G_W^{-1} (\theta _*) \right) \right] + o\left( \frac{V_t}{t}\right) \\&\quad = - \frac{2V_t}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _{\theta }l(x_t,\theta _*) \right] G_W^{-1} (\theta _*) + o\left( \frac{V_t}{t}\right) . \end{aligned} \end{aligned}$$

Combining all the terms we have in hand, we derive the following updating equation for Wasserstein covariances during a natural gradient descent

$$\begin{aligned} \begin{aligned} V_{t+1} =&\ V_t + \frac{1}{t^2} G_W^{-1}(\theta _*) \mathbb {E}_{p_{\theta _*}} \left[ \nabla _x \left( l(x_t,\theta _*) \right) \cdot \nabla _x \left( l(x_t,\theta _*)^{T} \right) \right] \left( G_W^{-1}(\theta _*) \right) \\&\quad - \frac{2V_t}{t} \mathbb {E}_{p_{\theta _*}} \left[ \nabla _{\theta }l(x_t,\theta _*) \right] G_W^{-1} (\theta _*) + o\left( \frac{V_t}{t}\right) + o\left( \frac{1}{t^2} \right) + O\left( \frac{V_t}{t^2}\right) . \end{aligned} \end{aligned}$$

Remark 17

The most frequently used tools in this proof is a separability property, c.f. proposition 5. The key observation is that, for two statistics \(T_1,T_2\) which depend on (independent) different variables, such as \(T_1 = T_1(x_1,...,x_{t-1})\), \(T_2 = T_2(x_t,...,x_{t+n})\) are “orthogonal” in both Wasserstein and Fisher metrics. Specifically, consider gradients of \(T_1,T_2\) w.r.t. \(x\), since they depend on different variables, thus

$$\begin{aligned} {{\,\mathrm{\textrm{Cov}}\,}}^W\left[ T_1, T_2 \right] = \mathbb {E}_{p_{\theta _*}}\left[ \nabla _x T_1 \cdot \nabla _x T_2 \right] = 0. \end{aligned}$$

This type of separability is a direct analog of the one in Fisher-Rao geometry, i.e.

$$\begin{aligned} {{\,\mathrm{\textrm{Cov}}\,}}^F\left[ T_1, T_2 \right] = \mathbb {E}_{p_{\theta _*}}\left[ T_1 T_2 \right] = \mathbb {E}_{p_{\theta _*}}\left[ T_1 \right] \cdot \mathbb {E}_{p_{\theta _*}}\left[ T_2 \right] = 0. \end{aligned}$$

1.2 C.2. Examples and numerical experiments of two efficiencies

Example 17

(Gaussian distribution) Consider the Gaussian distribution with mean value \(\mu \) and standard variance \(\sigma \) as

$$\begin{aligned} p(x;\mu ,\sigma )=\frac{1}{\sqrt{2\pi } \sigma }e^{-\frac{1}{2\sigma ^2}(x-\mu )^2}. \end{aligned}$$

The WIM satisfies

$$\begin{aligned} G_W(\mu ,\sigma )=\begin{pmatrix} 1 &{} 0 \\ 0 &{} 1 \end{pmatrix}. \end{aligned}$$

The Fisher information matrix satisfies

$$\begin{aligned} G_F(\mu ,\sigma ) = \begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{2}{\sigma ^2} \end{pmatrix}. \end{aligned}$$

Further, the matrix \(G_FG_W^{-1}\) is given by

$$\begin{aligned} G_F(\mu ,\sigma )G_W^{-1}(\mu ,\sigma ) = \begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{2}{\sigma ^2} \end{pmatrix}. \end{aligned}$$

And optimal parameters are given by \(\mu _*, \sigma _*\). Thus we have following conclusions on efficiency of the Fisher, Wasserstein natural gradients and Wasserstein natural gradient on Fisher score (maximal likelihood estimator).

The Wasserstein natural gradient is asymptotically efficient with an asymptotic Wasserstein covariance given by

$$\begin{aligned} V_t = \frac{1}{t}\begin{pmatrix} 1 &{} 0 \\ 0 &{} 1 \end{pmatrix} + O\left( \frac{1}{t^2} \right) . \end{aligned}$$

The Fisher natural gradient is asymptotic efficient with an asymptotic classical covariance given by

$$\begin{aligned} V_t = \frac{1}{t}\begin{pmatrix} \sigma _*^2 &{} 0 \\ 0 &{} \frac{\sigma _*^2}{2} \end{pmatrix} + O\left( \frac{1}{t^2} \right) . \end{aligned}$$

An interesting thing is that the covariance matrix appears in the Wasserstein efficiency is independent of the optimal value. While in Fisher case, the asymptotic behavior depends a lot on the optimal parameter we obtain.

For the last case, in the Gaussian family, two metric tensors \(G_F,G_W\) can be simultaneously diagonalized, thus the situation is even simpler. We denote the smallest eigenvalue of \(G_F G_W^{-1}\) as \(\alpha \), i.e.

$$\begin{aligned} \alpha = \frac{1}{\sigma _*^2}. \end{aligned}$$

Furthermore, we have to figure out the term

$$\begin{aligned} \mathbb {E}_{p_{\mu _*,\sigma _*}} \left[ \nabla _x \left( \nabla _{\mu _*,\sigma _*}l(x_t,\mu _*,\sigma _*) \right) \cdot \nabla _x \left( \nabla _{\mu _*,\sigma _*}l(x_t,\mu _*,\sigma _*)^{T} \right) \right] , \end{aligned}$$

that appears in the final result. In Gaussian family, since we have Fisher scores \(\nabla _{\mu _*,\sigma _*} l(x;\theta ) = \Phi ^F(x,\mu _*,\sigma _*)\) as

$$\begin{aligned} \begin{aligned} \Phi _\mu ^F(x;\mu ,\sigma ) = \frac{x - \mu }{\sigma ^2}, \quad \Phi _\sigma ^F(x;\mu ,\sigma ) = \frac{(x-\mu )^2}{\sigma ^3} - \frac{1}{\sigma }. \end{aligned} \end{aligned}$$

Via calculation, we have

$$\begin{aligned} \begin{aligned} \mathbb {E}_{p_{\mu _*,\sigma _*}} \left[ \nabla _x \Phi _\mu ^F(x;\mu _*,\sigma _*) \cdot \nabla _x \left( \Phi _\mu ^F(x;\mu _*,\sigma _*)^{T} \right) \right]&= \mathbb {E}_{p_{\mu _*,\sigma _*}}\left[ \frac{1}{\sigma _*^4} \right] = \frac{1}{\sigma _*^4}, \\ \mathbb {E}_{p_{\mu _*,\sigma _*}} \left[ \nabla _x \Phi _\mu ^F(x;\mu _*,\sigma _*) \cdot \nabla _x \left( \Phi _{\sigma }^F(x;\mu _*,\sigma _*)^{T} \right) \right]&= \mathbb {E}_{p_{\mu _*,\sigma _*}}\left[ \frac{1}{\sigma _*^2} \cdot \frac{2(x - \mu _*)}{\sigma _*^3} \right] \\&= 0, \\ \mathbb {E}_{p_{\mu _*,\sigma _*}} \left[ \nabla _x \Phi _{\sigma }^F(x;\mu _*,\sigma _*) \cdot \nabla _x \left( \Phi _{\sigma }^F(x;\mu _*,\sigma _*)^{T} \right) \right]&= \mathbb {E}_{p_{\mu _*,\sigma _*}}\left[ \frac{4\left( x - \mu _* \right) ^2}{\sigma _*^6} \right] \\&= \frac{4}{\sigma _*^4}, \end{aligned} \end{aligned}$$

we conclude the middle term is given by

$$\begin{aligned} \mathfrak {I} = \mathbb {E}_{p_{\mu _*,\sigma _*}} \left[ \nabla _x \left( \nabla _{\mu _*,\sigma _*}l(x_t,\mu _*,\sigma _*) \right) \cdot \nabla _x \left( \nabla _{\mu _*,\sigma _*}l(x_t,\mu _*,\sigma _*)^{T} \right) \right] = \begin{pmatrix} \frac{1}{\sigma _*^4} &{} 0 \\ 0 &{} \frac{4}{\sigma _*^4} \end{pmatrix}. \end{aligned}$$

And when we have \(\frac{2}{\sigma _*^2} > 1\), the inverse matrix of \(2B - \textbf{I}\) is given by

$$\begin{aligned} \left( 2B - \textbf{I}\right) ^{-1} = \begin{pmatrix} \frac{\sigma _*^2}{2 - \sigma _*^2} &{} 0 \\ 0 &{} \frac{\sigma _*^2}{4 - \sigma _*^2} \end{pmatrix}. \end{aligned}$$

Consequently, the term appearing in the asymptotic behavior of the Poincaré efficiency is given by

$$\begin{aligned} \begin{aligned} \frac{1}{t}\left( 2G_FG_W^{-1} - \textbf{I}\right) ^{-1}G_W^{-1}(\theta _*) \mathfrak {I} \left( G_W^{-1}(\theta _*) \right) =&\begin{pmatrix} \frac{\sigma _*^2}{2 - \sigma _*^2} &{} 0 \\ 0 &{} \frac{\sigma _*^2}{4 - \sigma _*^2} \end{pmatrix}\begin{pmatrix} \frac{1}{\sigma _*^4} &{} 0 \\ 0 &{} \frac{4}{\sigma _*^4} \end{pmatrix} \\ =&\begin{pmatrix} \frac{1}{\left( 2 - \sigma _*^2 \right) \sigma _*^2} &{} 0 \\ 0 &{} \frac{4}{\left( 4 - \sigma _*^2 \right) \sigma _*^2} \end{pmatrix}. \end{aligned} \end{aligned}$$

Thus the asymptotic behavior of the Wasserstein covariance in the Wasserstein natural gradient of Fisher scores is given by

$$\begin{aligned} V_t = \left\{ \begin{array}{ll} O\left( t^{-\frac{2}{\sigma _*^2}} \right) , &{} \frac{1}{\sigma _*^2} \le \frac{1}{2}, \\ \frac{1}{t}\begin{pmatrix} \frac{1}{\left( 2 - \sigma _*^2 \right) \sigma _*^2} &{} 0 \\ 0 &{} \frac{4}{\left( 4 - \sigma _*^2 \right) \sigma _*^2} \end{pmatrix} + O\left( \frac{1}{t^2}\right) ,&\frac{1}{\sigma _*^2} > \frac{1}{2}. \end{array}\right. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Zhao, J. Wasserstein information matrix. Info. Geo. 6, 203–255 (2023). https://doi.org/10.1007/s41884-023-00099-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41884-023-00099-9

Keywords

Navigation