Skip to main content
Log in

Accelerated Information Gradient Flow

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

We present a framework for Nesterov’s accelerated gradient flows in probability space to design efficient mean-field Markov chain Monte Carlo algorithms for Bayesian inverse problems. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://archive.ics.uci.edu/ml/datasets.php.

References

  1. Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)

    Article  Google Scholar 

  2. Amari, S.I.: Information Geometry and its Applications, vol. 194. Springer, New York (2016)

    Book  Google Scholar 

  3. Amari, S., Ole E.B.N., Robert E.K., Steffen L.L., Rao C.R.: Differential geometry in statistical inference, IMS (1987)

  4. Bernton, E.: Langevin monte carlo and JKO splitting. In Conference On Learning Theory, pp. 1777–1798. (2018)

  5. Carrillo, J., Choi, Y.P., Tse, O.: Convergence to equilibrium in Wasserstein distance for damped Euler equations with interaction forces. Commun. Math. Phys. 365(1), 329–361 (2019)

    Article  MathSciNet  Google Scholar 

  6. Carrillo, J., Craig, K., Patacchini, F.S.: A blob method for diffusion. Calc. Var. Partial Differ. Equ. 58(2), 53 (2019)

    Article  MathSciNet  Google Scholar 

  7. Cheng, X., Chatterji, N.S., Bartlett, P.L., Jordan, M.I.: Underdamped Langevin MCMC: a non-asymptotic analysis. arXiv preprint arXiv:1707.03663 (2017)

  8. Chow, S.N., Li, W., Zhou, H.: Wasserstein hamiltonian flows. arXiv preprint arXiv:1903.01088 (2019)

  9. Deco, G., Obradovic, D.: An information-theoretic approach to neural computing. Springer, Dragan (2012)

    MATH  Google Scholar 

  10. Duncan, A., Nüsken, N., Szpruch, L.: On the geometry of stein variational gradient descent. arXiv preprint arXiv:1912.00894 (2019)

  11. Garbuno-Inigo, A., Hoffmann, F., Li, W., Stuart, A.M.: Interacting Langevin diffusions: gradient structure and ensemble Kalman sampler. arXiv preprint arXiv:1903.08866 (2019)

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988)

    Article  MathSciNet  Google Scholar 

  14. Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Understanding and accelerating particle-based variational inference. In: International Conference on Machine Learning, pp. 4082–4092. (2019)

  15. Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., Carin, L.: Accelerated first-order methods on the Wasserstein space for Bayesian inference. arXiv preprint arXiv:1807.01750 (2018)

  16. Liu, Q.: Stein variational gradient descent as gradient flow. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 3115–3123. Curran Associates, Inc., US (2017)

    Google Scholar 

  17. Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386. (2016)

  18. Liu, Y., Shang, F., Cheng, J., Cheng, H., Jiao, L.: Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4868–4877. (2017)

  19. Ma, Y.A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I.: Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996 (2019)

  20. Maddison, C.J, Paulin, D., Teh, Y.W., O’Donoghue, B., Doucet, A.: Hamiltonian descent methods. arXiv preprint arXiv:1809.05042 (2018)

  21. Malago, L., Matteucci, M., Pistone, G.: Natural gradient, fitness modelling and model selection: aunifying perspective. In: 2013 IEEE Congress on Evolutionary Computation, pp. 486–493. IEEE (2013)

  22. Malagò, L., Montrucchio, L., Pistone, G.: Wasserstein Riemannian geometry of positive definite matrices. arXiv preprint arXiv:1801.09269 (2018)

  23. Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)

  24. Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. arXiv preprint arXiv:1601.01875 (2016)

  25. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  26. Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Differ. Equ. 26(1–2), 101–174 (2001)

    Article  MathSciNet  Google Scholar 

  27. O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

  28. Principe, J.C., Dongxin, X., Fisher, J., Haykin, S.: Information theoretic learning. Unsupervised Adapt. Filter. 1, 265–319 (2000)

    Google Scholar 

  29. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015)

  30. Saha, A.: A Geometric Framework for Modeling and Inference using the Nonparametric Fisher–Rao metric. PhD thesis, The Ohio State University (2019)

  31. Singh, R.S.: Improvement on some known nonparametric uniformly consistent estimators of derivatives of a density. Ann. Stat. 394–399 (1977)

  32. Srivastava, A., Klassen, E.P.: Functional and Shape Data Analysis, vol. 475. Springer, New York (2016)

    Book  Google Scholar 

  33. Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)

    Article  MathSciNet  Google Scholar 

  34. Su, W., Boyd, S., Candés, E.J.: A differential equation for modeling Nesterovs accelerated gradient method: theory and insights. J. Mach. Learn. Res. 27, 2510–2518 (2016)

    MathSciNet  Google Scholar 

  35. Taghvaei, A., Mehta, P.G.: Accelerated flow for probability distributions. arXiv preprint arXiv:1901.03317 (2019)

  36. Takatsu, Asuka: On Wasserstein geometry of the space of Gaussian measures. arXiv preprint arXiv:0801.2250 (2008)

  37. Villani, C: Topics in optimal transportation. American Mathematical Soc. (2003)

  38. Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Advances in Neural Information Processing Systems, pp. 7834–7844 (2019)

  39. Wang, Y., Jia, Z., Wen, Z.: The search direction correction makes first-order methods faster. arXiv preprint arXiv:1905.06507 (2019)

  40. Wang, Yifei, Li, Wuchen: Information newton’s flow: second-order optimization method in probability space. arXiv preprint arXiv:2001.04341 (2020)

  41. Wibisono, A.: Proximal langevin algorithm: rapid convergence under isoperimetry. arXiv preprint arXiv:1911.01469 (2019)

  42. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)

    Article  MathSciNet  Google Scholar 

  43. Wilson, A.C., Recht, B., Jordan, M.I.: A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635 (2016)

  44. Zhang, H., Sra, S.: Towards riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wuchen Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

In this appendix, we formulate detailed derivations of examples and proofs of propositions. We also design particle implementations of KW-AIG flows, S-AIG flows and provide detailed implementations of experiments.

Euler-Lagrange Equation, Hamiltonian Flows and AIG Flows

In this section, we review and derive Euler-Lagrange equation, Hamiltonian flows and Euler-Lagrange formulation of AIG flows in probability space.

1.1 Derivation of the Euler-Lagrange Equation

In this subsection, we derive the Euler-Lagrange equation in probability space. For a given metric \(g_\rho \) in probability space, we can define a Lagrangian by

$$\begin{aligned}{\mathcal {L}}(\rho _t,\partial _t\rho _t)=\frac{1}{2}g_{\rho _t}(\partial _t\rho _t,\partial _t\rho _t)-E(\rho _t).\end{aligned}$$

Proposition 5

The Euler-Lagrange equation for this Lagrangian follows

$$\begin{aligned} \partial _t\left( \frac{\delta {\mathcal {L}}}{\delta (\partial _t \rho _t)}\right) =\frac{\delta {\mathcal {L}}}{\delta \rho _t}+C(t), \end{aligned}$$

where C(t) is a spatially-constant function.

Proof

For a fixed \(T>0\) and two given densities \(\rho _0,\rho _T\), consider the variational problem

$$\begin{aligned} I(\rho _t)=\inf \limits _{\rho _t}\left\{ \left. \int _0^T{\mathcal {L}}(\rho _t,\partial _t\rho _t)dt\right| \rho _t|_{t=0}=\rho _0,\rho _t|_{t=T}=\rho _T\right\} . \end{aligned}$$

Let \(h_t\in {\mathcal {F}}(\varOmega )\) be the smooth perturbation function that satisfies \(\int h_t dx=0, t\in \left[ 0,T\right] \) and \(h_t|_{t=0}=h_t|_{t=T}\equiv 0\). Denote \(\rho _t^\epsilon = \rho _t+\epsilon h_t\). Note that we have the Taylor expansion

$$\begin{aligned} \begin{aligned} I(\rho _t^\epsilon )=&\int _0^T {\mathcal {L}}(\rho _t,\partial _t\rho _t)dt\\&+\epsilon \int _0^T\int \left( \frac{\delta {\mathcal {L}}}{\delta \rho _t}h_t+\frac{\delta {\mathcal {L}}}{\delta (\partial _t \rho _t)}\partial _th_t\right) dxdt+o(\epsilon ). \end{aligned} \end{aligned}$$

From \(\left. \frac{dI(\rho _t^\epsilon )}{d\epsilon }\right| _{\epsilon =0}=0\), it follows that

$$\begin{aligned} \int _0^T\int \left( \frac{\delta {\mathcal {L}}}{\delta \rho _t}h_t+\frac{\delta {\mathcal {L}}}{\delta (\partial _t \rho _t)}\partial _th_t\right) dxdt=0. \end{aligned}$$

Note that \(h_t|_{t=0}=h_t|_{t=T}\equiv 0\). Perform integration by parts w.r.t. t yields

$$\begin{aligned} \int _0^T\int \left( \frac{\delta {\mathcal {L}}}{\delta \rho _t}-\partial _t\frac{\delta {\mathcal {L}}}{\delta (\partial _t \rho _t)}\right) h_t dxdt=0. \end{aligned}$$

Because \(\int h_t dx=0\), the Euler-Lagrange equation holds with a spatially constant function C(t). \(\square \)

1.2 Derivation of Hamiltonian Flow

In this subsection, we derive the Hamiltonian flow in the probability space. Denote \(\varPhi _t = \delta L/\delta (\partial _t \rho _t)=G(\rho _t)\partial _t\rho _t\). Then, the Euler-Lagrange equation can be formulated as a system of \((\rho _t,\varPhi _t)\), i.e.,

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t-G(\rho _t)^{-1}\varPhi _t=0,\\&\partial _t\varPhi _t+\frac{1}{2}\frac{\delta }{\delta \rho _t}\left( \int \varPhi _tG(\rho _t)^{-1}\varPhi _tdx\right) +\frac{\delta E}{\delta \rho _t}=0. \end{aligned} \right. \end{aligned}$$

First, we give a useful identity. Given a metric tensor \(G(\rho ):T_\rho {\mathcal {P}}(\varOmega )\rightarrow T_\rho ^*{\mathcal {P}}(\varOmega )\), we have

$$\begin{aligned} \begin{aligned} \int \sigma _1 G(\rho )\sigma _2 dx&= \int G(\rho )\sigma _1\sigma _2dx\\&=\int \varPhi _1G(\rho )^{-1}\varPhi _2 dx= \int G(\rho )^{-1}\varPhi _1\varPhi _2 dx. \end{aligned} \end{aligned}$$
(16)

here \(\varPhi _1 = G(\rho )^{-1}\sigma _1\) and \(\varPhi _2 = G(\rho )^{-1}\sigma _2\). We then check that

$$\begin{aligned} \frac{\delta }{\delta \rho _t}\left( \int \partial _t\rho _t G(\rho _t) \partial _t\rho _tdx\right) =- \frac{\delta }{\delta \rho _t}\left( \int \varPhi _tG(\rho _t)^{-1}\varPhi _tdx\right) . \end{aligned}$$
(17)

Let \({{\tilde{\rho }}}_t = \rho _t+\epsilon h\), where \(h\in T_{\rho _t}{\mathcal {P}}(\varOmega )\). For all \(\sigma \in T_{\rho _t}{\mathcal {P}}\), it follows

$$\begin{aligned} G(\rho _t+\epsilon h )^{-1}G(\rho _t+\epsilon h ) \sigma = \sigma . \end{aligned}$$

The first-order derivative w.r.t. \(\epsilon \) of the left hand side shall be 0, i.e.,

$$\begin{aligned} \left( \frac{\partial G(\rho _t)^{-1}}{\partial \rho _t} \cdot h\right) G(\rho _t) \sigma +G(\rho _t) ^{-1}\left( \frac{\partial G(\rho _t)}{\partial \rho _t}\cdot h\right) \sigma = 0. \end{aligned}$$

Because \(\partial _t\rho _t=G(\rho )^{-1}\varPhi _t\), applying (16) yields

$$\begin{aligned} \begin{aligned} \int \partial _t \rho _t \left( \frac{\partial G(\rho _t)}{\partial \rho _t}\cdot h\right) \partial _t \rho _tdx&=\int \varPhi _t G(\rho _t)^{-1}\left( \frac{\partial G(\rho _t)}{\partial \rho _t}\cdot h\right) \partial _t \rho _tdx\\&= -\int \varPhi _t \left( \frac{\partial G(\rho _t)^{-1}}{\partial \rho _t}\cdot h\right) G(\rho _t)\partial _t \rho _tdx \\&=-\int \varPhi _t\left( \frac{\partial G(\rho _t)^{-1}}{\partial \rho _t}\cdot h\right) \varPhi _tdx. \end{aligned} \end{aligned}$$
(18)

Based on basic calculations, we can compute that

$$\begin{aligned} \int \partial _t\rho _t G({{\tilde{\rho }}}_t) \partial _t\rho _tdx-\int \partial _t\rho _t G(\rho _t)\partial _t\rho _tdx= & {} \epsilon \int \partial _t \rho _t \left( \frac{\partial G(\rho _t)}{\partial \rho _t}\cdot h\right) \partial _t \rho _tdx+o(\epsilon ), \end{aligned}$$
(19)
$$\begin{aligned}&-\int \varPhi _tG({{\tilde{\rho }}}_t)^{-1}\varPhi _tdx+\int \varPhi _tG(\rho _t)^{-1}\varPhi _tdx\nonumber \\= & {} -\epsilon \int \varPhi _t\left( \frac{\partial G(\rho _t)^{-1}}{\partial \rho _t}\cdot h\right) \varPhi _tdx+o(\epsilon ). \end{aligned}$$
(20)

Combining (18), (19) and (20) yields (17). Hence, the Euler-Lagrange equation is equivalent to

$$\begin{aligned} \begin{aligned}&\partial _t\varPhi _t=\frac{1}{2}\frac{\delta }{\delta \rho _t}\left( \int \partial _t\rho _t G(\rho _t) \partial _t\rho _tdx\right) -\frac{\delta E}{\delta \rho _t}=-\frac{1}{2}\frac{\delta }{\delta \rho _t}\left( \int \varPhi _tG(\rho _t)^{-1}\varPhi _tdx\right) -\frac{\delta E}{\delta \rho _t}. \end{aligned} \end{aligned}$$

This equation combining with \(\partial _t\rho _t=G(\rho )^{-1}\varPhi _t\) recovers the Hamiltonian flow. In short, the Euler-Lagrange equation is from the primal coordinates \((\rho _t,\partial _t\rho _t)\) and the Hamiltonian flow is from the dual coordinates \((\rho _t,\varPhi _t)\). Similar interpretations can be found in [8].

1.3 The Euler-Lagrangian Formulation of AIG Flows

We can formulate the AIG flow as a second-order equation of \(\rho _t\),

$$\begin{aligned} \frac{D^2}{D t^2}\rho _t+\alpha _t \partial _t\rho _t+ G(\rho _t)^{-1}\frac{\delta E}{\delta \rho _t}=0. \end{aligned}$$

here \(D^2/D t^2\) is the covariant derivative w.r.t. metric \(G(\rho )\). We can also explicitly write \(\frac{D^2}{D t^2}\rho _t\) as

$$\begin{aligned} \begin{aligned} \frac{D^2}{D t^2}\rho _t =&\partial _{tt}\rho _t-(\partial _t G(\rho _t)^{-1})\partial _t\rho _t+\frac{1}{2}G(\rho _t)^{-1}\frac{\delta }{\delta \rho _t}\left( \int \partial _t\rho _t G(\rho _t)\partial _t\rho _t dx\right) . \end{aligned} \end{aligned}$$

Derivation of Examples in Section 3

In this section, we present examples of gradient flows, Hamiltonian flows and derive particle dynamics examples in Sect. 3.

1.1 Examples of Gradient Flows

We first present several examples of gradient flows w.r.t. different metrics.

Example 12

(Fisher-Rao gradient flow) 

$$\begin{aligned} \begin{aligned} \partial _t\rho _t=&-G^F(\rho _t)^{-1}\frac{\delta E}{\delta \rho _t}=-\rho _t\left( \frac{\delta E}{\delta \rho _t}-\int \frac{\delta E}{\delta \rho _t}\rho _tdy\right) . \end{aligned} \end{aligned}$$

Example 13

(Wasserstein gradient flow) 

$$\begin{aligned} \begin{aligned} \partial _t\rho _t =&-G^W(\rho _t)^{-1}\frac{\delta E}{\delta \rho _t}=\nabla \cdot \left( \rho _t\nabla \frac{\delta E}{\delta \rho _t}\right) . \end{aligned} \end{aligned}$$

Example 14

(Kalman-Wasserstein gradient flow) 

$$\begin{aligned} \begin{aligned} \partial _t\rho _t =&-G^{KW}(\rho _t)^{-1}\frac{\delta E}{\delta \rho _t}=\nabla \cdot \left( \rho _t C^\lambda (\rho _t)\nabla \left( \frac{\delta E}{\delta \rho _t}\right) \right) . \end{aligned} \end{aligned}$$

Example 15

(Stein gradient flow) 

$$\begin{aligned} \begin{aligned} \partial _t\rho _t =&-G^S(\rho _t)^{-1}\frac{\delta E}{\delta \rho _t}=\nabla _x\cdot \left( \rho _t(x) \int k(x,y) \rho _t(y) \nabla _y\left( \frac{\delta E}{\delta \rho _t}\right) dy\right) . \end{aligned} \end{aligned}$$

1.2 Examples of Hamiltonian Flows

We next present several examples of Hamiltonian flows w.r.t. different metrics. The derivations simply follow from the definition of the given information metric and the formulations given in Appendix A.2.

Example 16

(Fisher-Rao Hamiltonian flow) The Fisher-Rao Hamiltonian flow follows

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t-\rho _t\left( \varPhi _t-{\mathbb {E}}_{\rho _t}[\varPhi _t]\right) =0,\\&\partial _t\varPhi _t+\frac{1}{2}\varPhi _t^2-{\mathbb {E}}_{\rho _t}[\varPhi _t]\varPhi _t+\frac{\delta E}{\delta \rho _t}=0, \end{aligned} \right. \end{aligned}$$

where the corresponding Hamiltonian is

$$\begin{aligned} {\mathcal {H}}^F(\rho _t,\varPhi _t)=\frac{1}{2}\left( {\mathbb {E}}_{\rho _t}[\varPhi _t^2]-\left( {\mathbb {E}}_{\rho _t}[\varPhi _t]\right) ^2\right) +E(\rho _t). \end{aligned}$$

The derivation comes from that

$$\begin{aligned} \begin{aligned} \frac{\delta }{\delta \rho _t} \int \varPhi _t G^F(\rho _t) \varPhi _t dx&= \frac{\delta }{\delta \rho _t}\left( {\mathbb {E}}_{\rho _t}[\varPhi _t^2]-\left( {\mathbb {E}}_{\rho _t}[\varPhi _t]\right) ^2\right) \\&=\varPhi _t^2-2{\mathbb {E}}_{\rho _t}[\varPhi _t]\varPhi _t. \end{aligned} \end{aligned}$$

Example 17

(Wasserstein Hamiltonian flow) The Wasserstein Hamiltonian flow writes

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t+\nabla \cdot (\rho _t\nabla \varPhi _t)=0,\\&\partial _t\varPhi _t+\frac{1}{2}\Vert \nabla \varPhi _t\Vert ^2+\frac{\delta E}{\delta \rho _t}=0, \end{aligned} \right. \end{aligned}$$

where the corresponding Hamiltonian is

$$\begin{aligned} {\mathcal {H}}^W(\rho _t,\varPhi _t)=\frac{1}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx+E(\rho _t). \end{aligned}$$

It is identical to the Wasserstein Hamiltonian flow introduced by [8]. The derivation simply comes from that

$$\begin{aligned} \begin{aligned} \frac{\delta }{\delta \rho _t} \int \varPhi _t G^W(\rho _t) \varPhi _t dx= \frac{\delta }{\delta \rho _t}\left( \int \Vert \nabla \varPhi _t\Vert _2^2\rho _t dx\right) =\Vert \nabla \varPhi _t\Vert ^2. \end{aligned} \end{aligned}$$

Example 18

(Kalman-Wasserstein Hamiltonian flow) The Kalman-Wasserstein Hamiltonian flow writes

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t+\nabla \cdot (\rho _tC^\lambda (\rho _t)\nabla \varPhi _t)=0,\\&\partial _t \varPhi _t+ \frac{1}{2}\left( (x-m(\rho _t))^TB_{\rho _t}(\varPhi _t)(x-m(\rho _t))+\nabla \varPhi _t(x)^TC^\lambda (\rho _t)\nabla \varPhi _t(x)\right) +\frac{\delta E}{\delta \rho _t}=0,\\ \end{aligned}\right. \end{aligned}$$

where the corresponding Hamiltonian is

$$\begin{aligned} {\mathcal {H}}^{KW}(\rho _t,\varPhi _t)=\frac{1}{2}\int \nabla \varPhi _t^T C^\lambda (\rho _t)\nabla \varPhi _t \rho _t dx+E(\rho _t). \end{aligned}$$

The derivation comes from that

$$\begin{aligned} \begin{aligned} \frac{\delta }{\delta \rho _t} \int \varPhi _t G^{KW}(\rho _t) \varPhi _t dx&=\frac{\delta }{\delta \rho _t}\left( \int \nabla \varPhi _t^T C^\lambda (\rho _t)\nabla \varPhi _t \rho _t dx\right) \\&=x-m(\rho _t))^TB_{\rho _t}(\varPhi _t)(x-m(\rho _t)) \\&\quad +\nabla \varPhi _t(x)^TC^\lambda (\rho _t)\nabla \varPhi _t(x). \end{aligned} \end{aligned}$$

here we recall that \(B_{\rho _t}(\varPhi _t)=\int \nabla \varPhi _t\nabla \varPhi _t^T\rho _t dx\).

Example 19

(Stein Hamiltonian flow) The Stein Hamiltonian flow writes

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t(x) = -\nabla _x\cdot \left( \rho _t(x) \int k(x,y) \rho _t(y) \nabla _y\varPhi _t(y) dy\right) ,\\&\partial _t\varPhi _t(x) = \int \nabla \varPhi _t(x)^T\nabla \varPhi _t(y) k(x,y) \rho _t(y)dy -\frac{\delta E}{\delta \rho _t}(x), \end{aligned}\right. \end{aligned}$$

where the corresponding Hamiltonian is

$$\begin{aligned} {\mathcal {H}}(\rho _t,\varPhi _t)=\frac{1}{2}\int \int \nabla \varPhi _t(x)^T\nabla \varPhi _t(y) k(x,y) \rho _t(x) \rho _t(y)dxdy +E(\rho _t). \end{aligned}$$

The derivation comes from that

$$\begin{aligned} \begin{aligned} \frac{\delta }{\delta \rho _t} \int \varPhi _t G^{S}(\rho _t) \varPhi _t dx&=\frac{\delta }{\delta \rho _t}\left( \int \int \nabla \varPhi _t(x)^T\nabla \varPhi _t(y) k(x,y) \rho _t(x) \rho _t(y)dxdy\right) \\&=2\int \nabla \varPhi _t(x)^T\nabla \varPhi _t(y) k(x,y) \rho _t(y)dy. \end{aligned} \end{aligned}$$

1.3 The Derivation of Example 9 (Wasserstein Metric) in Section 3

We start with an identity. For a twice differentiable \(\varPhi (x)\), we have

$$\begin{aligned} \frac{1}{2}\nabla \Vert \nabla \varPhi \Vert ^2 = \nabla ^2\varPhi \nabla \varPhi = (\nabla \varPhi \cdot \nabla )\nabla \varPhi . \end{aligned}$$
(21)

From (W-AIG), it follows that

$$\begin{aligned} \partial _t\rho _t+\nabla \cdot (\rho _t\nabla \varPhi _t) = 0. \end{aligned}$$
(22)

This is the continuity equation of \(\rho _t\). Hence, on the particle level, \(X_t\) shall follows

$$\begin{aligned} dX_t = \nabla \varPhi _t(X_t)dt. \end{aligned}$$

Let \(V_t=\nabla \varPhi _t(X_t)\). Then, by the material derivative in fluid dynamics and (W-AIG), we have

$$\begin{aligned} \begin{aligned} \frac{dV_t}{dt} =&\frac{d}{dt} \nabla \varPhi _t(X_t)= (\partial _t+\nabla \varPhi _t(X_t)\cdot \nabla )\nabla \varPhi _t(X_t)dt\\ =&\left( -\alpha _t\nabla \varPhi _t(X_t)-\frac{1}{2}\nabla \Vert \nabla \varPhi \Vert ^2-\nabla \frac{\delta E}{\delta \rho _t}\right) dt+(\nabla \varPhi \cdot \nabla )\nabla \varPhi dt\\ =&-\alpha _t\nabla \varPhi _t(X_t)dt-\nabla \frac{\delta E}{\delta \rho _t}(X_t)dt=-\alpha _tV_tdt-\nabla \frac{\delta E}{\delta \rho _t}(X_t)dt. \end{aligned} \end{aligned}$$

1.4 The Derivations of Examples 7 and 10 (Kalman-Wasserstein Metric) in Section 3

We first derive the Hamiltonian flow under the Kalman-Wasserstein metric. We fist show that

$$\begin{aligned} \frac{\delta }{\delta \rho } \left\{ \int \varPhi G^{KW}(\rho )^{-1}\varPhi dx\right\} =(x-m(\rho ))^T B_\rho (\varPhi ) (x-m(\rho ))+\nabla \varPhi (x)^TC^\lambda (\rho )\nabla \varPhi (x). \end{aligned}$$
(23)

From the definition of Kalman-Wasserstein metric, we have

$$\begin{aligned} \begin{aligned} \int \varPhi G^{KW}(\rho )^{-1}\varPhi dx&= \int \nabla \varPhi ^T C^\lambda (\rho )\nabla \varPhi \rho dx \\&=\left\langle C^\lambda (\rho ), \int \nabla \varPhi ^T \nabla \varPhi \rho dx\right\rangle \\&=\left\langle C^\lambda (\rho ), B_\rho (\varPhi )\right\rangle . \end{aligned} \end{aligned}$$

Let \({{\hat{\rho }}}=\rho +\epsilon h\), where \(h\in T_\rho {\mathcal {P}}(\varOmega )\). Then, we can compute that

$$\begin{aligned} \begin{aligned} \left\langle C^\lambda (\rho +\epsilon h), B_{\rho +\epsilon h}(\varPhi )\right\rangle -\left\langle C^\lambda (\rho ), B_\rho (\varPhi )\right\rangle&=\left\langle C^\lambda (\rho +\epsilon h)-C^\lambda (\rho ), B_{\rho }(\varPhi )\right\rangle \\&\quad +\left\langle C^\lambda (\rho ), B_{\rho +\epsilon h}(\varPhi )-B_\rho (\varPhi )\right\rangle . \end{aligned} \end{aligned}$$

We note that

$$\begin{aligned} \begin{aligned} C^\lambda (\rho +\epsilon h)-C^\lambda (\rho )&=\epsilon \int m(h) (x-m(\rho ))^T \rho dx \\&\quad +\epsilon \int (x-m(\rho ))m(h)^T \rho dx\\&\quad +\epsilon \int (x-m(\rho ))(x-m(\rho ))^T h dx +O(\epsilon ^2)\\&=\epsilon \int (x-m(\rho ))(x-m(\rho ))^T h dx +O(\epsilon ^2). \end{aligned} \end{aligned}$$
$$\begin{aligned} B_{\rho +\epsilon h}(\varPhi )-B_\rho (\varPhi ) = \epsilon \int h \nabla \varPhi \nabla \varPhi ^T dx. \end{aligned}$$

Hence, we can derive

$$\begin{aligned} \begin{aligned} \left\langle C^\lambda (\rho +\epsilon h), B_{\rho +\epsilon h}(\varPhi )\right\rangle -\left\langle C^\lambda (\rho ), B_\rho (\varPhi )\right\rangle&=\epsilon \int h \left\langle \nabla \varPhi \nabla \varPhi ^T, C(\rho )\right\rangle dx \\&\quad +\epsilon \int h \left\langle (x-m(\rho ))(x-m(\rho ))^T,B_\rho (\varPhi )\right\rangle dx \\&\quad +O(\epsilon ^2). \end{aligned} \end{aligned}$$

This proves (23). Hence, the Hamiltonian flow under the Kalman-Wasserstein metric follows

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t+\nabla \cdot (\rho _tC^\lambda (\rho _t)\nabla \varPhi _t)=0,\\&\partial _t \varPhi _t+ \frac{1}{2}\left( (x-m(\rho _t))^TB_{\rho _t}(\varPhi _t)(x-m(\rho _t))+\nabla \varPhi _t(x)^TC^\lambda (\rho _t)\nabla \varPhi _t(x)\right) +\frac{\delta E}{\delta \rho _t}=0.\\ \end{aligned}\right. \end{aligned}$$
(24)

Adding a linear damping term \(\alpha _t\varPhi _t\) to the second equation in (24) yields Example 7.

For Example 10, suppose that \(X_t\) follows \(\rho _t\) and \(V_t = \nabla \varPhi _t(X_t)\). Then, we shall have

$$\begin{aligned} \frac{d}{dt}X_t = C^\lambda (\rho _t) V_t, \end{aligned}$$

Note that \(V_t = \nabla \varPhi _t(X_t)\), we can establish that

$$\begin{aligned} \begin{aligned} \frac{d}{dt} V_t&= (\partial _t+(C^\lambda (\rho _t)\nabla \varPhi _t\cdot \nabla )\nabla \varPhi _t(X_t)\\&=\nabla \partial _t\varPhi _t(X_t)+\nabla ^2\varPhi _t(X_t)C^\lambda (\rho _t)\nabla \varPhi _t(X_t). \end{aligned} \end{aligned}$$

The last inequality can be established as follows. For \(i=1,\dots ,d\), we have

$$\begin{aligned} \begin{aligned} \left( C^\lambda (\rho _t)\nabla \varPhi _t\cdot \nabla \right) \nabla _i\varPhi _t(X_t)&=\sum _{j=1}^d\left( C^\lambda (\rho _t)\nabla \varPhi _t\right) _j\nabla _j\nabla _i\varPhi _t(X_t)\\&=\sum _{j=1}^d\nabla _{ij}\varPhi _t(X_t)\left( C^\lambda (\rho _t)\nabla \varPhi _t\right) _j \\&= \left( \nabla ^2\varPhi _tC^\lambda (\rho _t)\nabla \varPhi _t\right) _i. \end{aligned} \end{aligned}$$

According to the chain rule, we also have

$$\begin{aligned} \nabla \left( \nabla \varPhi _t(x)^TC^\lambda (\rho _t)\nabla \varPhi _t(x)\right) = 2\nabla ^2\varPhi _t(x)C^\lambda (\rho _t)\nabla \varPhi _t(x) \end{aligned}$$

As a result, we can establish that

$$\begin{aligned} \begin{aligned} \frac{d}{dt}V_t =&-\alpha _t V_t-B_{\rho _t}(\varPhi _t)(X_t-M(\rho _t))-\nabla \delta _{\rho _t} E \\ =&-\alpha _t V_t-{\mathbb {E}}[V_t V_t^T](X_t-{\mathbb {E}}[X_t])-\nabla \delta _{\rho _t} E. \end{aligned} \end{aligned}$$
(25)

In summary, the KW-AIG flow in the particle formulation takes the form (5)

1.5 The Derivations of Examples 8 and 11 (Stein Metric) in Section 3

For an objective function \(E(\rho )\), the Hamiltonian follows

$$\begin{aligned} {\mathcal {H}}(\rho ,\varPhi )=\frac{1}{2}\int \int \nabla \varPhi (x)^T\nabla \varPhi (y) k(x,y) \rho (x) \rho (y)dxdy +E(\rho ). \end{aligned}$$

We note that

$$\begin{aligned} \begin{aligned}&\frac{\delta }{\delta \rho }\left[ \frac{1}{2}\int \int \nabla \varPhi (x)^T\nabla \varPhi (y) k(x,y) \rho (x) \rho (y)dxdy\right] (x) \\&\quad =\int \nabla \varPhi (x)^T\nabla \varPhi (y) k(x,y) \rho (y)dy. \end{aligned} \end{aligned}$$

Hence, the Hamiltonian flow writes

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t(x) = -\nabla _x\cdot \left( \rho _t(x) \int k(x,y) \rho _t(y) \nabla _y\varPhi _t(y) dy\right) ,\\&\partial _t\varPhi _t(x) = - \int \nabla \varPhi _t(x)^T\nabla \varPhi _t(y) k(x,y) \rho _t(y)dy -\frac{\delta E}{\delta \rho _t}(x). \end{aligned}\right. \end{aligned}$$
(26)

Adding a linear damping term \(\alpha _t\varPhi _t\) to the second equation in (26) yields Example 8.

For Example 11, similarly, suppose that \(X_t\) follows \(\rho _t\) and \(V_t = \nabla \varPhi _t(X_t)\). Then, we shall have

$$\begin{aligned} \frac{d}{dt} X_t = \int k(X_t,y) \nabla \varPhi _t(y)\rho _t(y)dy. \end{aligned}$$

We note that

$$\begin{aligned} \begin{aligned} \nabla \left( \int \nabla \varPhi (x)^T\nabla \varPhi (y) k(x,y) \rho (y)dy\right)&=\nabla ^2 \varPhi (x) \int \nabla \varPhi (y) k(x,y) \rho (y)dy \\&\quad +\int \nabla \varPhi (x)^T \nabla \varPhi (y) \nabla _x k(x,y) \rho (y)dy. \end{aligned} \end{aligned}$$

Hence, we have

$$\begin{aligned} \begin{aligned} \frac{d}{dt} V_t&= \partial _t\nabla \varPhi _t(X_t) +\nabla ^2\varPhi _t (X_t)\left( \int k(x,y) \rho _t(y) \nabla _y\varPhi _t(y) dy\right) \\&=-\alpha _t\ \nabla \varPhi _t(X_t)-\int \nabla \varPhi _t(X_t)^T \nabla \varPhi _t (y) \nabla _x k(X_t,y) \rho (y)dy-\nabla \left( \frac{\delta E}{\delta \rho _t}\right) (X_t)\\&=-\alpha _tV_t-\int V_t^T\nabla \varPhi _t (y) \nabla _x k(X_t,y) \rho (y)dy-\nabla \left( \frac{\delta E}{\delta \rho _t}\right) (X_t). \end{aligned} \end{aligned}$$

This derives Example 11.

Wasserstein Metric in Gaussian Families

In this section, we first introduce the Wasserstein metric, gradient flows and Hamiltonian flows in Gaussian families. Then, we validate the existence of (W-AIG) in Gaussian families. Denote \({\mathcal {N}}_n^0\) to the multivariate Gaussian densities with zero means. Namely, if \(\rho _0, \rho ^*\in {\mathcal {N}}_n^0\), then we show that (W-AIG) has a solution \((\rho _t,\varPhi _t)\) and \(\rho _t\in {\mathcal {N}}_n^0\).

Let \({\mathbb {P}}^n\) and \({\mathbb {S}}^n\) represent symmetric positive definite matrix and symmetric matrix with size \(n\times n\) respectively. Each \(\rho \in {\mathcal {N}}_n^0\) is uniquely determined by its covariance matrix \(\varSigma \in {\mathbb {P}}^n\). The Wasserstein metric \(G^W(\rho )\) on \({\mathcal {P}}({\mathbb {R}}^n)\) induces the Wasserstein metric \(G^W(\varSigma )\) on \({\mathbb {P}}^n\), which is also known as the Bures metric, see [22, 24, 36]. For \(\varSigma \in {\mathbb {P}}^n\), the tangent and cotangent space follow \(T_{\varSigma }{\mathbb {P}}^n\simeq T^*_{\varSigma }{\mathbb {P}}^n\simeq {\mathbb {S}}^n\).

Definition 3

(Wasserstein metric in Gaussian families) For \(\varSigma \in {\mathbb {P}}^n\), the metric tensor \(G^W(\varSigma ):{\mathbb {S}}^n\rightarrow {\mathbb {S}}^n\) is defined by

$$\begin{aligned} G^W(\varSigma )^{-1}S = 2(\varSigma S+S\varSigma ). \end{aligned}$$

The Wasserstein metric on \({\mathbb {S}}^n\) follows

$$\begin{aligned} g_{\varSigma }^W(A_1,A_2) = {\text {tr}}(A_1G(\varSigma )A_2)= 4{\text {tr}}(S_1\varSigma S_2), \end{aligned}$$

where \(S_i\in {\mathbb {S}}^n\) is the solution to

$$\begin{aligned} A_i = 2(\varSigma S_i+S_i\varSigma ),\quad i=1,2. \end{aligned}$$

1.1 Gradient Flows and Hamiltonian Flows in Gaussian Families

We derive the Wasserstein gradient flow and the Wasserstein Hamiltonian flow in Gaussian families as follows.

Proposition 6

The Wasserstein gradient flow in Gaussian families writes

$$\begin{aligned} {{\dot{\varSigma }}}_t = -2(\varSigma _t\nabla _{\varSigma _t} E(\varSigma _t)+\nabla _{\varSigma _t} E(\varSigma _t)\varSigma _t). \end{aligned}$$

here \(\nabla _{\varSigma _t}\) is the standard matrix derivative.

The Wasserstein Hamiltonian flow satisfies

$$\begin{aligned} \left\{ \begin{aligned}&{{\dot{\varSigma }}}_t-2(S_t\varSigma _t+\varSigma _t S_t)=0,\\&\dot{S}_t+2S_t^2+\nabla _{\varSigma _t} E(\varSigma _t)=0, \end{aligned}\right. \end{aligned}$$
(27)

where \(S_t\in {\mathbb {S}}^n\). The corresponding Hamiltonian satisfies

$$\begin{aligned}{\mathcal {H}}^W(\varSigma _t,S_t)=2{\text {tr}}(S_t\varSigma _tS_t)+E(\varSigma _t).\end{aligned}$$

The derivation of the gradient flow simply follows the definition of Wasserstein metric in Gaussian families.

We then derive the Hamiltonian flow as follows. For \(A\in {\mathbb {S}}^n\), we define the linear operator \(M_A:{\mathbb {S}}^n\rightarrow {\mathbb {S}}^n\) by

$$\begin{aligned} M_AB = AB+BA,\quad B\in {\mathbb {S}}^n. \end{aligned}$$

It is easy to verify that if \(A\in {\mathbb {P}}^n\), then \(M_A^{-1}\) is well-defined. For a flow \(\varSigma _t\in {\mathbb {P}}^n, t\ge 0\), we define the Lagrangian \(L(\varSigma _t,{\dot{\varSigma }}_t)=\frac{1}{2}g_{\varSigma _t}({{\dot{\varSigma }}}_t,{\dot{\varSigma }}_t)-E(\varSigma _t).\) The corresponding Euler-Lagrange equation writes

$$\begin{aligned} \frac{d }{d t}\frac{d L}{d{{\dot{\varSigma }}}_t}=\frac{d L}{d \varSigma }. \end{aligned}$$
(28)

Let \(S_t=\frac{1}{2}M_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t\), i.e., \({\dot{\varSigma }}_t=2(S_t\varSigma _t+\varSigma _t S_t)\). Then, it follows

$$\begin{aligned} \begin{aligned} g_{\varSigma _t}({{\dot{\varSigma }}}_t,{{\dot{\varSigma }}}_t)&=\,4{\text {tr}}(S_t\varSigma _tS_t)=2{\text {tr}}((S_t\varSigma _t+\varSigma _t S_t)S_t)\\&=\,{\text {tr}}({{\dot{\varSigma }}}_tS_t)=\frac{1}{2}{\text {tr}}\left( {{\dot{\varSigma }}}_tM_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t\right) . \end{aligned} \end{aligned}$$

This leads to \(\frac{d L}{d {\dot{\varSigma }}_t}=\frac{1}{2}M_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t=S_t.\) For simplicity, we denote \(g=g_{\varSigma _t}({{\dot{\varSigma }}}_t,{{\dot{\varSigma }}}_t)\). First, we show that

$$\begin{aligned} \frac{d g}{d \varSigma _t}=-4S_t^2. \end{aligned}$$

Because \(S_t=\frac{1}{2}M_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t\). Given \({{\dot{\varSigma }}}_t\), \(S_t\) can be viewed as a continuous function of \(\varSigma _t\). For any \(A\in {\mathbb {S}}^n\), define \(l_A={\text {tr}}((\varSigma _t S_t+S_t \varSigma _t)A)\).

$$\begin{aligned} \begin{aligned} 0=&\frac{d l_A}{d\varSigma _t}=\frac{\partial S_t}{\partial \varSigma _t} \frac{\partial l_A}{\partial S_t}+\frac{\partial l_A}{\partial \varSigma _t}\\ =&\frac{\partial S_t}{\partial \varSigma _t}(A\varSigma _t+\varSigma _t A)+(AS_t+S_tA). \end{aligned} \end{aligned}$$

here we view \(\partial S_T/\partial \varSigma _t\) as a linear operator on \(S^n\). Let \(B=A\varSigma _t+\varSigma _t A\), then \(A=M_{\varSigma _t}^{-1}B\). \(\frac{\partial S_t}{\partial \varSigma _t} B+M_{S_t}M_{\varSigma _t}^{-1}B=0\) holds for all \(B\in S^n\). Therefore, we have \(\frac{\partial S_t}{\partial \varSigma _t}=-M_{S_t}M_{\varSigma _t}^{-1}\). Hence,

$$\begin{aligned} \begin{aligned} \frac{d g}{d\varSigma _t}=&\frac{\partial S_t}{\partial \varSigma _t} \frac{\partial g}{\partial S_t}+\frac{\partial g}{\partial \varSigma _t} \\ =&-4M_{S_t}M_{\varSigma _t}^{-1}(S_t\varSigma _t+\varSigma _tS_t)+4S_t^2\\ =&-4M_{S_t}S_t+4S_t^2=-4S_t^2. \end{aligned} \end{aligned}$$

As a result, the Euler-Lagrange equation (28) is equivalent to

$$\begin{aligned} \dot{S}_t =\frac{d}{dt}\frac{d L}{d{{\dot{\varSigma }}}_t}=\frac{d L}{d\varSigma _t}=-2S_t^2-\nabla E(\varSigma _t). \end{aligned}$$
(29)

Combining (29) with \({{\dot{\varSigma }}}_t=S_t\varSigma _t+\varSigma _t S_t\) renders the Hamiltonian flow in Gaussian families.

1.2 Proof of Proposition 2

By adding a damping term \(\alpha _tS_t\), we derive (W-AIG-G), i.e., the Wasserstein AIG flow in Gaussian families. We present the proof of Proposition 2 as follows. We first show that \(\varSigma _t\) stays in \({\mathbb {P}}^n\). Suppose that \(\varSigma _t\in {\mathbb {P}}_n\) for \(0\le t\le T\). Define \(H_t=H(\varSigma _t,S_t)=2{\text {tr}}(S_t\varSigma _t S_t)+E(\varSigma _t)\). We observe that (W-AIG-G) is equivalent to

$$\begin{aligned} {{\dot{\varSigma }}}_t = \frac{\partial H_t}{\partial S_t},\quad \dot{S}_t = -\alpha _t S_t-\frac{\partial H_t}{\partial \varSigma _t}. \end{aligned}$$
(30)

We show that \(H_t\) is decreasing with respect to t.

$$\begin{aligned} \begin{aligned} \frac{dH_t}{dt}=&{\text {tr}}\left( \frac{\partial H_t}{\partial S_t} \dot{S}_t+\frac{\partial H_t}{\partial \varSigma _t} {{\dot{\varSigma }}}_t\right) \\ =&{\text {tr}}\left( \frac{\partial H_t}{\partial S_t}\left( -\alpha _tS_t-\frac{\partial H_t}{\partial \varSigma _t}\right) +\frac{\partial H_t}{\partial \varSigma _t}\frac{\partial H_t}{\partial S_t}\right) \\ =&-\alpha _t{\text {tr}}\left( S_t\frac{\partial H_t}{\partial S_t}\right) =-2\alpha _t{\text {tr}}(S_t(\varSigma _tS_t+S_t\varSigma _t))\\ =&-4\alpha _t {\text {tr}}(S_t\varSigma _tS_t)\le 0. \end{aligned} \end{aligned}$$

For simplicity, we denote \(W^*=(\varSigma ^*)^{-1}\). Let \(\lambda _t\) be the smallest eigenvalue of \(\varSigma _t\). Then, \(\log \det (\varSigma _tW^*)=\log \det W^*+\log \det (\varSigma _t)\ge \log \det W^*+n\log \lambda _t.\) Therefore,

$$\begin{aligned} \begin{aligned} -\frac{n}{2}(\log \lambda _t+1)-\frac{1}{2}\log \det W^*&\le -\frac{1}{2}\left[ \log \det (\varSigma _tW^*)+n\right] \\&\le E(\varSigma _t)\le H(t)\le H(0), \end{aligned} \end{aligned}$$

which yields that

$$\begin{aligned} \lambda _t\ge \exp \left( -\frac{2}{n}H(0)-1-\frac{1}{n}\log \det W^*\right) . \end{aligned}$$
(31)

This means that as long as \(\varSigma _t\in {\mathbb {P}}_n\), the smallest eigenvalue of \(\varSigma _t\) has a positive lower bound. If there exists \(T>0\) such that \(\varSigma _T\notin {\mathbb {P}}_n\). Because \(\varSigma _t\) is continuous with respect to t, there exists \(T_1<T\), such that \(\varSigma _t\in P_n\), \(0\le t\le T_1\) and \(\lambda _{T_1}<\exp \left( -2H(0)/n-1\right) \), which violates (31).

We then reveal the relationship between (W-AIG) in \({\mathcal {P}}({\mathbb {R}}^n)\) and \({\mathbb {P}}^n\). We observe that

$$\begin{aligned} \begin{aligned}&\frac{\partial }{\partial t}\det (\varSigma _t)=\det (\varSigma _t){\text {tr}}(\varSigma _t^{-1}{{\dot{\varSigma }}}_t),\\&\frac{\partial }{\partial t} \varSigma _t^{-1}=-\varSigma _t^{-1}{{\dot{\varSigma }}}_t\varSigma _t^{-1}. \end{aligned} \end{aligned}$$

Combining with \({{\dot{\varSigma }}}_t=2(\varSigma _tS_t+S_t\varSigma _t)\), we obtain

$$\begin{aligned} \begin{aligned}&{\text {tr}}\left( \varSigma _t^{-1}{{\dot{\varSigma }}}_t\right) =2{\text {tr}}(S_t+\varSigma _t^{-1}S_t\varSigma _t)=4{\text {tr}}(S_t),\\&{\text {tr}}\left( x\varSigma _t^{-1}{{\dot{\varSigma }}}_t \varSigma _t^{-1}x\right) =2{\text {tr}}(x^T\varSigma _t^{-1}S_tx+x^TS_t\varSigma _t^{-1}x)=4{\text {tr}}(S_t\varSigma _t^{-1}xx^T). \end{aligned} \end{aligned}$$

Therefore, it follows

$$\begin{aligned} \begin{aligned} \partial _t \rho _t(x) =&\frac{\partial }{\partial t}\left( \frac{1}{{\sqrt{\det (\varSigma _t)}}}\right) \sqrt{\det (\varSigma _t)}\rho _t(x) +\frac{1}{2}{\text {tr}}(x^T\varSigma _t^{-1}{{\dot{\varSigma }}}_t\varSigma _t^{-1}x)\rho _t(x)\\ =&-\frac{1}{2}{\text {tr}}(\varSigma _t^{-1}{{\dot{\varSigma }}}_t)\rho _t(x)+2{\text {tr}}(S_t\varSigma _t^{-1}xx^T)\rho _t(x)\\ =&-2{\text {tr}}(S_t(I-\varSigma _t^{-1}xx^T))\rho _t(x).\\ \end{aligned} \end{aligned}$$

Note that \(\nabla \varPhi _t(x) = 2S_tx\). Hence, we have

$$\begin{aligned} \begin{aligned} -\nabla \cdot (\rho _t\nabla \varPhi _t)&= -2\sum _{i=1}^n \partial _i (\rho _t(x) S_tx)_i\\&=-2\sum _{i=1}^n\left[ \rho _t(x) \partial _i(S_tx)_i+(S_tx)_i\partial _i \rho _t(x)\right] \\&=-2\rho _t(x) \left[ {\text {tr}}(S_t)+(S_tx)^T(-\varSigma _t^{-1}x)\right] \\&=-2\rho _t(x){\text {tr}}(S_t(I-\varSigma _t^{-1}xx^T))=\partial _t\rho _t(x). \end{aligned} \end{aligned}$$

The first equation of (W-AIG) holds. Because \(\partial _t\varPhi _t(x)=x^T\dot{S}_tx+\dot{C}(t)\),

$$\begin{aligned} \begin{aligned} \partial _t\varPhi _t(x)+\alpha _t\varPhi _t(x)+\frac{1}{2}\Vert \nabla \varPhi _t(x)\Vert ^2&=x^T\dot{S}_tx+\alpha _tx^TS_tx+2x^TS_t^2x+\dot{C}(t)\\&=-x^T\nabla _{\varSigma _t} E(\varSigma _t)x+\dot{C}(t)\\&=\frac{1}{2}x^T(\varSigma _t^{-1}-W^*)x+\dot{C}(t). \end{aligned} \end{aligned}$$

Note that \(\rho ^*\) is the Gaussian density with the covariance matrix \(\varSigma ^*\). Because \(\dot{C}(t) = \frac{1}{2}\log \det (\varSigma _tW^*)-1\), we can compute

$$\begin{aligned} \begin{aligned} \frac{\delta E}{\delta \rho _t}=&\log \rho _t(x)-\log \rho ^*(x)+1\\ =&-\frac{1}{2}x^T(\varSigma _t^{-1}-W^*)x-\frac{1}{2}\log \det (\varSigma _tW^*)+1\\ =&-\frac{1}{2}x^T(\varSigma _t^{-1}-W^*)x-\dot{C}(t) \\ =&-(\partial _t\varPhi _t(x)+\alpha _t\varPhi _t(x)+\frac{1}{2}\Vert \nabla \varPhi _t(x)\Vert ^2). \end{aligned} \end{aligned}$$

Therefore, the second equation of (W-AIG) holds. Because \(\varSigma _t|_{t=0}=\varSigma _0\), \(S_t|_{t=0}=0\) and \(C(0)=0\), we have \(\rho _t|_{t=0}=\rho _0\) and \(\varPhi _t|_{t=0}=0\). This completes the proof.

Proof of Convergence Rate Under Wasserstein Metric

In this section, we briefly review the Riemannian structure of probability space and present proofs of propositions in Sect. 4 under Wasserstein metric.

1.1 A Brief Review on the Geometric Properties of the Probability Space

Suppose that we have a metric \(g_\rho \) in probability space \({\mathcal {P}}(\varOmega )\). Given two probability densities \(\rho _0,\rho _1\in {\mathcal {P}}(\varOmega )\), we define the distance as follows

$$\begin{aligned} \begin{aligned}&{\mathcal {D}}(\rho _0,\rho _1)^2 =\inf _{{{\hat{\rho }}}_s}\left\{ \int _0^1g_{\hat{\rho }_s}(\partial _s{{\hat{\rho }}}_s,\partial _s{{\hat{\rho }}}_s)ds:\hat{\rho }_s|_{s=0}=\rho _0,{{\hat{\rho }}}_s|_{s=1}=\rho _1\right\} . \end{aligned} \end{aligned}$$

The minimizer \({{\hat{\rho }}}_s\) of the above problem is defined as the geodesic curve connecting \(\rho _0\) and \(\rho _1\). An exponential map at \(\rho _0\in {\mathcal {P}}(\varOmega )\) is a mapping from the tangent space \(T_{\rho _0}{\mathcal {P}}(\varOmega )\) to \({\mathcal {P}}(\varOmega )\). Namely, \(\sigma \in T_{\rho _0}{\mathcal {P}}(\varOmega )\) is mapped to a point \(\rho _1\in {\mathcal {P}}(\varOmega )\) such that there exists a geodesic curve \({{\hat{\rho }}}_s\) satisfying \({{\hat{\rho }}}_s|_{s=0}=\rho _0,\partial _s\hat{\rho }_s|_{s=0}=\sigma ,\) and \({{\hat{\rho }}}_s|_{s=1}=\rho _1\).

1.2 The Inverse of Exponential Map

In this subsection, we characterize the inverse of exponential map in the probability space with the Wasserstein metric.

Proposition 7

Denote the geodesic curve \(\gamma (s)\) that connects \(\rho _t\) and \(\rho ^*\) by \(\gamma (s)=(sT_t+(1-s){\text {Id}})\#\rho _t,\,s\in [0,1]\). Here \({\text {Id}}\) is the identity mapping from \({\mathbb {R}}^n\) to itself. Then, \(\partial _s\gamma (s)|_{s=0}\) corresponds to a tangent vector \(-\nabla \cdot (\rho _t(x) (T_t(x)-x))\in T_{\rho _t}{\mathcal {P}}(\varOmega )\).

For simplicity, we denote \(T_t^{s}=(sT_t+(1-s){\text {Id}})^{-1},s\in \left[ 0,1\right] \). Based on the theory of optimal transport [37], we can write the explicit formula of the geodesic curve \(\gamma (s)\) by

$$\begin{aligned} \gamma (s) = T_t^{s}\#\rho _t=\det (\nabla T_t^s)\rho _t\circ T_t^s. \end{aligned}$$

Through basic calculations, we can compute that

$$\begin{aligned}&\left. \frac{d}{ds}T_t^s\right| _{s=0}=-\left. \frac{d}{ds}(sT_t+(1-s){\text {Id}})\right| _{s=0}={\text {Id}}-T_t. \\&\left. \frac{d}{ds}\det (\nabla T_t^s)\right| _{s=0}=\left. \frac{d}{ds}\det (I+s(I-DT_t)+o(s))\right| _{s=0} \\&\quad \quad \quad \quad \quad \quad \quad \quad \quad \,\,\, ={\text {tr}}(I-DT_t). \end{aligned}$$

Therefore, we have

$$\begin{aligned} \begin{aligned} \left. \partial _s \gamma (s)\right| _{s=0}(x)&={\text {tr}}(I-\nabla T_t)\rho _t(x)+\left\langle \nabla \rho _t(x),x-\varphi _t(x)\right\rangle \\&=\nabla \cdot (x-T_t(x))\rho _t(x)+\left\langle \nabla \rho _t(x),x-T_t(x)\right\rangle \\&=-\nabla \cdot (\rho _t(x)(T_t(x)-x)), \end{aligned} \end{aligned}$$

which completes the proof.

1.3 The Proof of Proposition 4 and 5

The main goal of this subsection is to prove the Lyapunov function \({\mathcal {E}}(t)\) is non-increasing.

Preparations We first give a better characterization of the optimal transport plan \(T_t\). We can write \(T_t=\nabla \varPsi _t\), where \(\varPsi _t\) is a strictly convex function, see [37]. This indicates that \(\nabla T_t\) is symmetric. We then introduce the following proposition.

Proposition 8

Suppose that \(E(\rho )\) satisfies Hess(\(\beta \)) for \(\beta \ge 0\). Let \(T_t(x)\) be the optimal transport plan from \(\rho _t\) to \(\rho ^*\), then

$$\begin{aligned} \begin{aligned} E(\rho ^*)\ge&E(\rho _t)+\int \left\langle T_t(x)-x,\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho dx+\frac{\beta }{2} \int \Vert T_t(x)-x\Vert ^2\rho _t dx. \end{aligned} \end{aligned}$$

This is a direct result of \(\beta \)-displacement convexity of \(E(\rho )\) based on Proposition 7.

Lemma 2

Denote \(u_t=\partial _t (T_t)^{-1}\circ T_t\). Then,\(u_t\) satisfies

$$\begin{aligned} \nabla \cdot \left( \rho _t(u_t-\nabla \varPhi _t)\right) =0. \end{aligned}$$
(32)

We also have

$$\begin{aligned} \partial _tT_t(x)=-\nabla T_t(x)u_t(x). \end{aligned}$$
(33)

Proof

Because \((T_t)^{-1}\#{\rho ^*}={\rho _t}\), let \(u_t=\partial _t(T_t)^{-1}\circ T_t\) and \(X_t=(T_t)^{-1}X_0\), where \(X_0\sim \rho ^*\). This yields \(\frac{d}{dt}X_t=u_t(X_t)\). The distribution of \(X_t\) follows \(\rho _t\). By the Euler’s equation, \(\rho _t\) shall follows

$$\begin{aligned}\partial _t\rho _t +\nabla \cdot (\rho _t u_t)=0.\end{aligned}$$

Combining this with the continuity equation (22) yields (32).

Then, we formulate \(\partial _tT_t(x)\) with \(u_t\). By the Taylor expansion,

$$\begin{aligned} T_{t+s}(x)=T_t(x)+s\partial _t T_t(x)+o(s). \end{aligned}$$

Let \(y=(T_t)^{-1}x\). it follows

$$\begin{aligned} \begin{aligned} (T_{t+s})^{-1}(x)=&(T_t)^{-1}(x)+su_t((T_t)^{-1}(x))+o(s)=y+su_t(y)+o(s). \end{aligned} \end{aligned}$$

Therefore, we have

$$\begin{aligned} \begin{aligned} 0&=T_{t+s}((T_{t+s})^{-1}(x))-x\\&=T_{t+s}(y+su_t (y)+o(s))-x\\&=T_t(y+su_t (y))+s\partial _t T_t(y+su_t (y))-x+o(s)\\&=T_t(y)+s\nabla T_t(y) u_t(y)+s\partial _t T_t(y)-x+o(s)\\&=s\left[ \nabla T_t(y) u_t(y)+\partial _t T_t(y)\right] +o(s). \end{aligned} \end{aligned}$$

We shall have \(\nabla T_t(y) u_t(y)+\partial _t T_t(y)=0\). Replacing y by x yields (33). \(\square \)

The following lemma illustrates two important properties of \(u_t\) and \(\partial _tT_t\).

Lemma 3

For \(u_t\) satisfying (32), we have

$$\begin{aligned} \begin{aligned}&\int \left\langle \nabla \varPhi _t-u_t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx\ge 0,\\&\int \left\langle \nabla \varPhi _t-u_t, \nabla T_t(x)(T_t(x)-x) \right\rangle \rho _t=0. \end{aligned} \end{aligned}$$

Proof

We first notice that \(u_t-\nabla \varPhi _t\) is divergence-free in term of \(\rho _t\). From \(-\nabla T_t u_t = \partial _tT_t = \nabla \partial _t\varPsi _t\), we observe that \(-\nabla T_t u_t\) is the gradient of \(\partial _t\varPsi _t\). Therefore,

$$\begin{aligned} \begin{aligned}&\int \left\langle \nabla \varPhi _t-u_t, \nabla T_tu_t \right\rangle \rho _t = -\int \left\langle \partial _t \varPsi _t, \nabla \cdot (\rho _t(\nabla \varPhi _t-u_t))\right\rangle =0. \end{aligned} \end{aligned}$$

Based on our previous characterization on the optimal transport plan \(T_t\), \(\nabla T_t = \nabla ^2\varPsi _t\) is symmetric positive definite. This yields that

$$\begin{aligned} \begin{aligned} \int \left\langle \nabla \varPhi _t-u_t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx&=\int \left\langle \nabla \varPhi _t-u_t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx-\int \left\langle \nabla \varPhi _t-u_t, \nabla T_tu_t \right\rangle \rho _t\\&=\int \left\langle \nabla \varPhi _t-u_t,\nabla T_t(\nabla \varPhi _t-u_t)\right\rangle \rho _tdx\ge 0. \end{aligned} \end{aligned}$$

The last inequality utilizes that \(\nabla T_t\) is positie definite and \(\rho _t\) is non-negative. Then, we prove the equality in Lemma 3. Because \(\nabla T_t(x)(T_t(x)-x) = \frac{1}{2}\nabla ( \Vert T_t(x)-x\Vert ^2+T_t(x)-\Vert x\Vert ^2)\) is a gradient. Similarly, it follows

$$\begin{aligned} \int \left\langle \nabla \varPhi _t-u_t, \nabla T_t(x)(T_t(x)-x) \right\rangle \rho _t=0. \end{aligned}$$

\(\square \)

Lemma 3 and the relationship (33) gives

$$\begin{aligned} -\int \left\langle \partial _tT_t, \nabla \varPhi _t\right\rangle \rho _t dx= & {} \int \left\langle u_t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx\le \int \left\langle \nabla \varPhi _t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx, \end{aligned}$$
(34)
$$\begin{aligned} \int \left\langle \partial _tT_t, T_t(x)-x\right\rangle \rho _t dx= & {} -\int \left\langle \nabla \varPhi _t, \nabla T_t(x)(T_t(x)-x)\right\rangle \rho _tdx. \end{aligned}$$
(35)

Proof of Proposition 4

Based on the definition of the Wasserstein metric, we have

$$\begin{aligned} \partial _t E(\rho _t) = -\int \frac{\delta E}{\delta \rho _t}\nabla \cdot (\rho _t\nabla \varPhi _t)dx. \end{aligned}$$

Differentiating \({\mathcal {E}}(t)\) w.r.t. t renders

$$\begin{aligned} {{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}\nonumber&=\beta \int \left\langle \partial _t T_t, T_t(x)-x\right\rangle \rho _tdx-\frac{\beta }{2}\int \Vert T_t(x)-x\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t) dx\nonumber \\&\quad -\sqrt{\beta } \int \left\langle \partial _tT_t,\nabla \varPhi _t\right\rangle \rho _tdx-\sqrt{\beta }\int \left\langle T_t(x)-x,\partial _t\nabla \varPhi _t\right\rangle \rho _tdx\nonumber \\&\quad +\sqrt{\beta }\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \nabla \cdot (\rho _t\nabla \varPhi _t) dx+\int \left\langle \nabla \varPhi _t,\partial _t\nabla \varPhi _t\right\rangle \rho _tdx\nonumber \\&\quad -\frac{1}{2}\int \Vert \nabla \varPhi _t\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t)-\int \frac{\delta E}{\delta \rho _t}\nabla \cdot (\rho _t\nabla \varPhi _t)dx\nonumber \\&\quad +\frac{\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx-\beta \int \left\langle T_t(x)-x,\nabla \varPhi _t(x)\right\rangle \rho _tdx\nonumber \\&\quad +\frac{\sqrt{\beta ^3}}{2} \int \Vert T_t(x)-x\Vert ^2\rho _tdx+\sqrt{\beta }(E(\rho _t)-E(\rho ^*)). \end{aligned}$$
(36)

For the part (36), Proposition 8 renders

$$\begin{aligned} \begin{aligned} \frac{\sqrt{\beta ^3}}{2} \int \Vert T_t(x)-x\Vert ^2\rho _tdx+\sqrt{\beta }E(\rho _t) \le -\sqrt{\beta } \int \left\langle T_t(x)-x,\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _t dx. \end{aligned} \end{aligned}$$
(37)

We first compute the terms with the coefficient \(\beta ^0\) in \({{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}\nonumber \). We observe that

$$\begin{aligned} \begin{aligned}&\int \left\langle \nabla \varPhi _t,\partial _t\varPhi _t\right\rangle \rho _tdx -\frac{1}{2}\int \Vert \nabla \varPhi _t\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t)dx\\&\qquad -\int \frac{\delta E}{\delta \rho _t}\nabla \cdot (\rho _t\nabla \varPhi _t)\rho _t dx\\&\quad =\int \left\langle \partial _t\nabla \varPhi _t+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2+\nabla \frac{\delta E}{\delta \rho },\nabla \varPhi _t\right\rangle \rho _tdx\\&\quad =-2\sqrt{\beta }\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx, \end{aligned} \end{aligned}$$
(38)

where the last equality uses (W-AIG) with \(\alpha _t=2\sqrt{\beta }\). Substituting (37) and (38) into the expression of \({{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}\) yields

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}&\le \beta \int \left\langle \partial _t T_t, T_t(x)-x\right\rangle \rho _tdx-\frac{\beta }{2}\int \Vert T_t(x)-x\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t) dx\\&\quad -\beta \int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx-\sqrt{\beta } \int \left\langle \partial _tT_t,\nabla \varPhi _t\right\rangle \rho _tdx\\&\quad -\sqrt{\beta }\int \left\langle T_t(x)-x,\partial _t\nabla \varPhi _t\right\rangle \rho _tdx-\sqrt{\beta } \int \left\langle T_t(x)-x,\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _t dx\\&\quad +\sqrt{\beta }\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \nabla \cdot (\rho _t\nabla \varPhi _t) dx-\frac{3\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx. \end{aligned} \end{aligned}$$
(39)

Then, we deal with the terms with \(\nabla \cdot (\rho _t\nabla \varPhi _t)\). We have the following two identities

$$\begin{aligned}&\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \nabla \cdot (\rho _t\nabla \varPhi _t) dx\nonumber \\&\quad =-\int \left\langle \nabla \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle ,\nabla \varPhi _t\right\rangle \rho _t dx \nonumber \\&\quad =-\int \left\langle \nabla \varPhi _t,\nabla ^2\varPhi _t(x) (T_t(x)-x)+(\nabla T_t-I)\nabla \varPhi _t\right\rangle \rho _tdx \nonumber \\&\quad =-\frac{1}{2}\int \left\langle T_t(x)-x,\nabla \Vert \nabla \varPhi _t\Vert ^2\right\rangle \rho _t dx-\int \left\langle \nabla \varPhi _t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _t dx+\int \Vert \nabla \varPhi _t\Vert ^2\rho _t dx. \end{aligned}$$
(40)
$$\begin{aligned}&\qquad -\frac{1}{2}\int \Vert T_t(x)-x\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t) dx\nonumber \\&=\int \left\langle (\nabla T_t(x)-I)(T_t(x)-x), \nabla \varPhi _t\right\rangle \rho _tdx\nonumber \\&\quad =\int \left\langle T_t(x)-x,\nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx-\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx. \end{aligned}$$
(41)

Hence, we can proceed to compute the terms with the coefficient \(\sqrt{\beta }\). (34) and (40) yields

$$\begin{aligned} \begin{aligned}&-\sqrt{\beta } \int \left\langle \partial _tT_t,\nabla \varPhi _t\right\rangle \rho _tdx-\sqrt{\beta }\int \left\langle T_t(x)-x,\partial _t\nabla \varPhi _t+\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx\\&\quad \quad -\frac{3\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx+\sqrt{\beta }\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \nabla \cdot (\rho _t\nabla \varPhi _t) dx\\&\quad =-\sqrt{\beta }\int \left\langle \partial _tT_t+ \nabla T_t\nabla \varPhi _t, \nabla \varPhi _t\right\rangle \rho _t dx-\frac{\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx\\&\quad \quad -\sqrt{\beta }\int \left\langle T_t(x)-x,\partial _t\nabla \varPhi _t+\nabla \frac{\delta E}{\delta \rho }+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2\right\rangle \rho _tdx\\&\quad \le -\frac{\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx+2\beta \int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx. \end{aligned} \end{aligned}$$
(42)

Substituting (41) and (42) into (39) gives

$$\begin{aligned} \begin{aligned}&{{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}+\frac{\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx\\&\quad \le \beta \int \left\langle \partial _t T_t, T_t(x)-x\right\rangle \rho _tdx-\frac{\beta }{2}\int \Vert T_t(x)-x\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t) dx\\&\qquad -\beta \int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx+2\beta \int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx\\&\quad =\beta \int \left\langle \partial _t T_t+\nabla T_t\nabla \varPhi _t, T_t(x)-x\right\rangle \rho _tdx=0, \end{aligned} \end{aligned}$$

where the last equality uses (35). In summary, we have

$$\begin{aligned} {{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}\le -\frac{\sqrt{\beta }}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx\le 0. \end{aligned}$$

Proof of Proposition 5

Differentiating \({\mathcal {E}}(t)\) w.r.t. t, we compute that

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t)&=\int \left\langle \partial _t T_t, T_t(x)-x\right\rangle \rho _t dx-\frac{1}{2}\int \Vert T_t(x)-x\Vert ^2\nabla \cdot (\rho _t\nabla \varPhi _t)dx\\&\quad -\int \left\langle \partial _t T_t, \frac{t}{2}\nabla \varPhi _t\right\rangle \rho _tdx-\int \left\langle T_t(x)-x,\frac{1}{2}\nabla \varPhi _t+\frac{t}{2}\partial _t\nabla \varPhi _t\right\rangle \rho _tdx\\&\quad +\int \left\langle T_t(x)-x, \frac{t}{2}\nabla \varPhi _t\right\rangle \nabla \cdot (\rho _t\nabla \varPhi _t)dx+\int \left\langle \frac{t}{2}\nabla \varPhi _t, \frac{1}{2}\nabla \varPhi _t+\frac{t}{2}\partial _t\nabla \varPhi _t\right\rangle \rho _tdx\\&\quad -\frac{1}{2}\int \left\| \frac{t}{2}\nabla \varPhi _t\right\| ^2\nabla \cdot (\rho _t\nabla \varPhi _t)dx-\frac{t^2}{4}\int \frac{\delta E}{\delta \rho _t}\nabla \cdot (\rho _t\nabla \varPhi _t)dx+\frac{t}{2}(E(\rho _t)-E(\rho ^*)).\\ \end{aligned} \end{aligned}$$
(43)

Because \(E(\rho )\) is Hess(0), Proposition 8 yields

$$\begin{aligned} E(\rho _t)=E(\rho _t)-E(\rho ^*)\le -\int \left\langle T_t(x)-x, \nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _t dx. \end{aligned}$$
(44)

Utilizing the inequality (44) and substituting the expressions of terms involving \(\partial _t T_t\) and \(\nabla \cdot (\rho _t\nabla \varPhi _t)\) in (43) with the expressions in (34) (35) and (40) (41), we obtain

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t) \le&-\int \left\langle \nabla \varPhi _t, \nabla T_t(x)(T_t(x)-x)\right\rangle \rho _tdx+\int \left\langle T_t(x)-x,\nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx\\&-\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx+\frac{t}{2}\int \left\langle \nabla \varPhi _t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _tdx\\&-\frac{1}{2}\int \left\langle T_t(x)-x, \nabla \varPhi _t\right\rangle \rho _tdx-\frac{t}{2}\int \left\langle \partial _t\nabla \varPhi _t, T_t(x)-x\right\rangle \rho _tdx\\&-\frac{t}{4}\int \left\langle T_t(x)-x,\nabla \Vert \nabla \varPhi _t\Vert ^2\right\rangle \rho _t dx-\frac{t}{2}\int \left\langle \nabla \varPhi _t, \nabla T_t\nabla \varPhi _t\right\rangle \rho _t dx\\&+\frac{t}{2}\int \Vert \nabla \varPhi _t\Vert ^2\rho _t dx+\frac{t}{4}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx+\frac{t^2}{4}\int \left\langle \nabla \varPhi _t, \partial _t\nabla \varPhi _t\right\rangle \rho _tdx\\&+\frac{t^2}{8}\int \left\langle \nabla \varPhi _t,\nabla \Vert \nabla \varPhi _t\Vert ^2\right\rangle \rho _tdx+\frac{t^2}{4}\int \left\langle \nabla \varPhi _t, \nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx\\&-\frac{t}{2}\int \left\langle T_t(x)-x, \nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _t dx. \end{aligned} \end{aligned}$$
(45)

The expression of (45) can be reformulated into

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t)\le&-\frac{3}{2}\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx+\frac{3t}{4}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx\\&-\frac{t}{2}\int \left\langle T_t(x)-x, \partial _t\nabla \varPhi _t+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2+\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx\\&+\frac{t^2}{4}\int \left\langle \nabla \varPhi _t, \partial _t\nabla \varPhi _t+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2+\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx. \end{aligned} \end{aligned}$$

From (W-AIG) with \(\alpha _t=3/t\), we have the following equalities.

$$\begin{aligned} \frac{t^2}{4}\int \left\langle \nabla \varPhi _t, \partial _t\nabla \varPhi _t+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2+\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx= & {} -\frac{3t}{4}\int \Vert \nabla \varPhi _t\Vert ^2\rho _tdx, \\ -\frac{t}{2}\int \left\langle T_t(x)-x, \partial _t\nabla \varPhi _t+\frac{1}{2}\nabla \Vert \nabla \varPhi _t\Vert ^2+\nabla \frac{\delta E}{\delta \rho _t}\right\rangle \rho _tdx= & {} \frac{3}{2}\int \left\langle T_t(x)-x,\nabla \varPhi _t\right\rangle \rho _tdx. \end{aligned}$$

As a result, \({{\dot{{\mathcal {E}}}}}(t)\le 0\). This completes the proof. \(\square \)

1.4 Comparison with the Proof in [35]

The accelerated flow in [35] is given by

$$\begin{aligned} \frac{d X_t}{dt} = e^{\alpha _t-\gamma _t}Y_t,\quad \frac{d Y_t}{dt} = -e^{\alpha _t+\beta _t+\gamma _t}\nabla \left( \frac{\delta E}{\delta _{\rho _t}}\right) (X_t). \end{aligned}$$
(46)

here the target distribution satisies \(\rho _\infty (x)=\rho ^*(x)\propto \exp (-f(x))\). Suppose that we take \(\alpha _t = \log p-\log t\), \(\beta _t = p\log t +\log C\) and \(\gamma _t = p\log t\). Here we specify \(p=2\) and \(C=1/4\). Then the accelerated flow (46) recovers the particle formulation of W-AIG flows if we replace \(Y_t\) by \(2t^{-3}V_t\). The Lyapunov function in [35] follows

$$\begin{aligned} \begin{aligned} V(t)&= \frac{1}{2}{\mathbb {E}}\left[ \Vert X_t+e^{-\gamma _t}Y_t-T_{\rho _t}^{\rho ^*}(X_t)\Vert ^2\right] +e^{\beta _t}(E(\rho )-E(\rho ^*))\\&=\frac{1}{2}{\mathbb {E}}\left[ \Vert X_t+\frac{t}{2}V_t-T_{\rho _t}^{\rho ^*}(X_t)\Vert ^2\right] +\frac{t^2}{4}(E(\rho _t)-E(\rho ^*))\\&=\frac{1}{2}\int \left\| - (T_t(x)-x)+\frac{t}{2} \nabla \varPhi _t(x)\right\| ^2\rho _t(x) dx+\frac{t^2}{4}(E(\rho _t)-E(\rho ^*)). \end{aligned} \end{aligned}$$

The last equality is based on the fact that \(V_t = \nabla \varPhi _t(X_t)\) and \(T_t=T_{\rho _t}^{\rho ^*}\) is the optimal transport plan from \(\rho _t\) to \(\rho ^*\). This indicates that the Lyapunov function in [35] is identical to ours. The technical assumption in [35] follows

$$\begin{aligned} \begin{aligned} 0&={\mathbb {E}}\left[ \left( X_t+e^{-\gamma _t}Y_t-T_{\rho _t}^{\rho ^*}(X_t)\right) \cdot \frac{d}{dt}T_{\rho _t}^{\rho ^*}(X_t)\right] \\&={\mathbb {E}}\left[ \left( X_t+\frac{t}{2}V_t-T_t(X_t)\right) \cdot \frac{d}{dt}T_t(X_t)\right] \\&={\mathbb {E}}\left[ \left( X_t+\frac{t}{2}V_t-T_t(X_t)\right) \cdot \left( (\partial _tT_t)(X_t)+\nabla T_tV_t\right) \right] \\&=\int \left\langle x-T_t(x)+\frac{t}{2}\nabla \varPhi _t(x),\partial _tT_t+\nabla T_t \nabla \varPhi _t\right\rangle \rho _tdx. \end{aligned} \end{aligned}$$

Based on \(\partial _tT_t=-\nabla T_tu_t\) and Lemma 3, we have

$$\begin{aligned} \int \left\langle x-T_t(x),\partial _tT_t+\nabla T_t \nabla \varPhi _t\right\rangle \rho _tdx&= \int \left\langle x-T_t(x),\nabla T_t (\nabla \varPhi _t-u_t)\right\rangle \rho _tdx=0. \\ \int \left\langle \nabla \varPhi _t,\partial _tT_t+\nabla T_t \nabla \varPhi _t\right\rangle \rho _tdx&= \int \left\langle \nabla \varPhi _t,\nabla T_t(\nabla \varPhi _t-u_t)\right\rangle \rho _tdx\\&=\int \left\langle \nabla \varPhi _t-u_t,\nabla T_t(\nabla \varPhi _t-u_t)\right\rangle \rho _tdx\ge 0. \end{aligned}$$

As a result, we have

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \left( X_t+e^{-\gamma _t}Y_t-T_{\rho _t}^{\rho _\infty }(X_t)\right) \cdot \frac{d}{dt}T_{\rho _t}^{\rho _\infty }(X_t)\right] = \frac{t}{2}\int \left\langle \nabla \varPhi _t-u_t,\nabla T_t(\nabla \varPhi _t-u_t)\right\rangle \rho _tdx\ge 0. \end{aligned} \end{aligned}$$

In 1-dimensional case, because \(\nabla \cdot \left( \rho _t(u_t-\nabla \varPhi _t)\right) =0\) indicates that \(\rho _t(u_t-\nabla \varPhi _t)=0\). For \(\rho _t(x)>0\), we have \(u_t(x)-\nabla \varPhi _t(x) = 0\). So the technical assumption holds. In general cases, although \(u_t = \partial _t(T_t)^{-1}\circ T_t\) satisfies \(\nabla \cdot \left( \rho _t(u_t-\nabla \varPhi _t)\right) =0\), but this does not necessary indicate that \(u_t=\nabla \varPhi _t\). Hence, \({\mathbb {E}}\left[ \left( X_t+e^{-\gamma _t}Y_t-T_{\rho _t}^{\rho _\infty }(X_t)\right) \cdot \frac{d}{dt}T_{\rho _t}^{\rho _\infty }(X_t)\right] =0\) does not necessary hold except for 1-dimensional case.

Proof of Convergence Rate Under Fisher-Rao Metric

In this section, we present proofs of propositions in Sect. 4 under Fisher-Rao metric.

1.1 Geodesic Curve Under the Fisher-Rao Metric

We first investigate on the explicit solution of geodesic curve under the Fisher-Rao metric in probability space. The geodesic curve shall satisfy

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t-(\varPhi _t-{\mathbb {E}}_{\rho _t}[\varPhi _t])\rho _t=0,\\&\partial _t\varPhi _t+\frac{1}{2}\varPhi _t^2-{\mathbb {E}}_{\rho _t}[\varPhi _t]\varPhi _t=0. \end{aligned}\right. \end{aligned}$$
(47)

with initial values \(\rho _t|_{t=0}=\rho _0\) and \(\varPhi _t|_{t=0}=\varPhi _0\). The Hamiltonian follows

$$\begin{aligned} {\mathcal {H}}(\rho ,\varPhi ) = \frac{1}{2}({\mathbb {E}}_{\rho _t}[\varPhi _t^2]-\left( {\mathbb {E}}_{\rho _t}[\varPhi _t]\right) ^2). \end{aligned}$$

We reparametrize \(\rho _t\) by \(\rho _t= R_t^2\) with \(R_t>0\) and \(\int R_t^2 dx=1\). Then,

$$\begin{aligned} \left\{ \begin{aligned}&\partial _tR_t-\frac{1}{2}(\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])R_t=0,\\&\partial _t\varPhi _t+\frac{1}{2}\varPhi _t^2-{\mathbb {E}}_{R_t^2}[\varPhi _t]\varPhi _t=0. \end{aligned}\right. \end{aligned}$$

Proposition 9

The solution to (47) with initial values \(\rho _t|_{t=0}=\rho _0\) and \(\varPhi _t|_{t=0}=\varPhi _0\) follows

$$\begin{aligned} R(x,t) = A(x)\sin (Ht)+B(x)\cos (Ht), \end{aligned}$$
(48)

where

$$\begin{aligned} A(x) = \frac{1}{2H} R_0(x)\left( \varPhi _0(x)-{\mathbb {E}}_{R_0^2}[\varPhi _0]\right) ,\quad B(x)=R_0(x), \end{aligned}$$
(49)

and

$$\begin{aligned} H=\frac{1}{2}\sqrt{{\mathbb {E}}_{R_0^2}[\varPhi _0^2]-\left( {\mathbb {E}}_{R_0^2}[\varPhi _0]\right) ^2}. \end{aligned}$$

We also have \(\int R_t^2 dx=1\) for \(t\ge 0\).

Proof

We can compute that

$$\begin{aligned} \begin{aligned} 2\partial _{tt}R_t=&\left( \partial _t\varPhi _t-2\int R_t\varPhi _t\partial _t R_t dx-{\mathbb {E}}_{R_t^2}[\partial _t\varPhi _t]\right) R_t+\partial _t R_t (\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])\\ =&\left( -\frac{1}{2}\varPhi _t^2+\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t^2]+{\mathbb {E}}_{R_t^2}[\varPhi _t]\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t]^2\right) R_t\\&-{\mathbb {E}}_{R_t^2}[\varPhi _t(\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])] R_t+\frac{1}{2} R_t(\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])^2\\ =&\left( -\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t^2]+\frac{1}{2}\left( {\mathbb {E}}_{R_t^2}[\varPhi _t]\right) ^2\right) R_t. \end{aligned} \end{aligned}$$

In other words,

$$\begin{aligned} \partial _{tt} R_t = \left( -\frac{1}{4}{\mathbb {E}}_{R_t^2}[\varPhi _t^2]+\frac{1}{4}{\mathbb {E}}_{R_t^2}[\varPhi _t]^2\right) R_t. \end{aligned}$$

We observe that \(\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t^2]-\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t]^2={\mathcal {H}}(\rho _t, \varPhi _t)\) is the Hamiltonian, which is invariant along the geodesic curve. Denote

$$\begin{aligned} H=\sqrt{\frac{1}{2}{\mathcal {H}}(\rho _t,\varPhi _t)} = \frac{1}{2}\sqrt{{\mathbb {E}}_{R_0^2}[\varPhi _0^2]-\left( {\mathbb {E}}_{R_0^2}[\varPhi _0]\right) ^2}. \end{aligned}$$

Then, we have

$$\begin{aligned} \partial _{tt}R_t=-H^2R_t, \end{aligned}$$

which is a wave equation. We also notice that

$$\begin{aligned} R_t(x)|_{t=0}=R_0(x),\quad \partial _tR_t(x)|_{t=0}=R_0(x)(\varPhi _0(x)-{\mathbb {E}}_{R_0^2}[\varPhi _0]). \end{aligned}$$

Hence, \(R_t\) is uniquely determined by

$$\begin{aligned} R_t(x) = A(x)\sin (Ht)+B(x)\cos (Ht), \end{aligned}$$

where A(x) and B(x) are given in (49). Finally, we verify that \(\int R_t^2 dx=1\). Actually, we can compute that

$$\begin{aligned}&\int A^2(x)dx=\frac{1}{4H^2} {\mathbb {E}}_{R_0^2} [(\varPhi _0(x)-{\mathbb {E}}_{R_0^2}[\varPhi _0])^2]=1, \\&\int B^2(x)dx = \int R_0^2(x)dx=1, \\&\int A(x)B(x) dx = \frac{1}{2H} {\mathbb {E}}_{R_0^2}[\varPhi _0(x)-{\mathbb {E}}_{R_0^2}[\varPhi _0]]=0. \end{aligned}$$

Hence,

$$\begin{aligned} \begin{aligned} \int R_t(x)^2dx&= \sin ^2(Ht)\int A^2(x)dx+\cos ^2(Ht)\int B^2(x)dx \\&\quad +2\sin (Ht)\cos (Ht)\int A(x)B(x) dx =1. \end{aligned} \end{aligned}$$

\(\square \)

Proposition 10

Suppose that \(\rho _0,\rho _1>0\), \(\rho _0\ne \rho _1\). Then, there exists a geodesic curve \(\rho (t)\) with \(\rho _t|_{t=0}=\rho _0\) and \(\rho _t|_{t=1}=\rho _1\).

Proof

We denote \(R_0(x)=\sqrt{\rho _0(x)}\) and \(R_1(x)=\sqrt{\rho _1(x)}\). We only need to solve A(x) and \(H>0\) such that

$$\begin{aligned} R_1(x) = A(x)\sin (H)+R_0(x)\cos (H), \end{aligned}$$

We shall have

$$\begin{aligned} \int R_1(x)R_0(x)dx = \cos (H), \end{aligned}$$

which indicates \(H=\cos ^{-1} \left( \int R_1(x)R_0(x)dx \right) \in (0,\pi /2]\). Hence, we have

$$\begin{aligned} A(x) = \frac{R_1(x)-R_0(x)\cos (H)}{\sin (H)}. \end{aligned}$$

We can examine that

$$\begin{aligned} \int A^2(x)dx = \frac{1-2\cos ^2(H)+\cos ^2(H)}{\sin ^2(H)}=1. \end{aligned}$$

On the other hand, we shall examine that

$$\begin{aligned} R_t(x)>0,\quad t\in [0,1]. \end{aligned}$$

Indeed,

$$\begin{aligned} \begin{aligned} R_t(x)&= A(x)\sin (Ht)+R_0(x)\cos (Ht)\\&=\frac{\sin (Ht)(R_1(x)-R_0(x)\cos (H))+R_0(x)\cos (Ht)\sin (H)}{\sin (H)}\\&=\frac{1}{\sin H} (\sin (Ht)R_1(x)+(\cos (Ht)\sin (H)-\sin (Ht)\cos (H))R_0(x))\\&=\frac{1}{\sin H} (\sin (Ht)R_1(x)+\sin (H(1-t)) R_0(x))>0. \end{aligned} \end{aligned}$$

Hence, \(\rho _t(x)=R_t^2(x)\) is the geodesic curve. \(\square \)

A direct derivation is the Fisher-Rao distance between \(\rho _0\) and \(\rho _1\). Namely, we can recover \(\varPhi _0\) by

$$\begin{aligned} \varPhi _0(x) = \frac{2HA(x)}{R_0(x)}. \end{aligned}$$

We note that \({\mathcal {H}}(\rho _t,\varPhi _0)=4H^2\). Hence, we have

$$\begin{aligned} \left( {\mathcal {D}}^{FR}(\rho _0,\rho _1)\right) ^2 = \int _0^1 4H^2 dt = 4H^2. \end{aligned}$$

Remark 9

We note that the manifold \(({\mathcal {P}}^+(\varOmega ), {\mathcal {G}}^{FR}(\rho ))\) is homeomorphic to the manifold \((S^+(\varOmega ), {\mathcal {G}}^{E}(R))\), where \(S^+(\varOmega )=\{R\in {\mathcal {F}}(\varOmega ):R>0,\int R^2 dx =1\}\). Here \((S^+(\varOmega ), {\mathcal {G}}^{E}(R))\) is the submanifold to \({\mathbb {L}}^2(\varOmega )\) equiped with the standard Euclidean metric.

1.2 Convergence Analysis

We consider accelerated Fisher-Rao gradient flows

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t\rho _t-(\varPhi _t-{\mathbb {E}}_{\rho _t}[\varPhi _t])\rho _t=0,\\&\partial _t\varPhi _t+\alpha _t\varPhi _t+\frac{1}{2}\varPhi _t^2-{\mathbb {E}}_{\rho _t}[\varPhi _t]\varPhi _t+\frac{\delta E}{\delta \rho _t}=0. \end{aligned}\right. \end{aligned}$$
(50)

In the sense of \(R_t\), we have

$$\begin{aligned} \left\{ \begin{aligned}&\partial _tR_t-\frac{1}{2}(\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])R_t=0,\\&\partial _t\varPhi _t+\alpha _t\varPhi _t+\frac{1}{2}\varPhi _t^2-{\mathbb {E}}_{R^2_t}[\varPhi _t]\varPhi _t+\frac{\delta E}{\delta \rho _t}=0. \end{aligned}\right. \end{aligned}$$
(51)

Then, we prove the convergence results for \(\beta \)-strongly convex \(E(\rho )\). Here we take \(\alpha _t=2\sqrt{\beta }\). Consider the Lyapunov function

$$\begin{aligned} \begin{aligned} {\mathcal {E}}(t) =&\frac{e^{\sqrt{\beta }t}}{2}\int |\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t]-\sqrt{\beta } T_t|^2\rho _tdx\\&+e^{\sqrt{\beta } t} (E(\rho _t)-E(\rho ^*)). \end{aligned} \end{aligned}$$

Here we define

$$\begin{aligned} T_t(x) = \frac{2 H_t}{\sin (H_t)}\frac{R^*(x)-R_t(x)\cos (H_t)}{R_t(x)}, \quad H_t=\cos ^{-1} \left( \int R_t(x)R^*(x)dx \right) . \end{aligned}$$

We can rewrite the Lyapunov function as

$$\begin{aligned} \begin{aligned} {\mathcal {E}}(t) =&\frac{e^{\sqrt{\beta }t}}{2}\int (\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])^2\rho _tdx-\sqrt{\beta }e^{\sqrt{\beta }t}\int (\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])T_t\rho _tdx\\&+\frac{\beta e^{\sqrt{\beta }t}}{2}\int T_t^2\rho _tdx+e^{\sqrt{\beta } t} (E(\rho _t)-E(\rho ^*)). \end{aligned} \end{aligned}$$

Remark 10

Here it may be problematic if \(R_t(x)=0\) for some x. But in total,

$$\begin{aligned} \int T_t^2\rho _t dx = \int (R_tT_t)^2 dx. \end{aligned}$$

is well-defined.

From the definition of convexity in probability space, we derive the following proposition.

Proposition 11

The \(\beta \)-convexity of \(E(\rho )\) indicates that

$$\begin{aligned} E(\rho ^*)\ge E(\rho _t)+\int \left( \frac{\delta E}{\delta \rho _t}-{\mathbb {E}}_{\rho _t}\left[ \frac{\delta E}{\delta \rho _t}\right] \right) T_t\rho _tdx+\frac{\beta }{2}\int T_t^2\rho _tdx. \end{aligned}$$

For simplicity, we define

$$\begin{aligned} {\mathcal {F}}_t[\varPsi ] = \varPsi -{\mathbb {E}}_{R_t^2}[\varPsi ]. \end{aligned}$$

We have

$$\begin{aligned} \begin{aligned} \partial _t({\mathcal {F}}_t[\varPsi ]) =&\partial _t\varPsi - {\mathbb {E}}_{R_t^2}[\partial _t\varPsi ]-\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPsi dx={\mathcal {F}}_t[\partial _t\varPsi ] -\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPsi dx. \end{aligned} \end{aligned}$$

Before we perform computations, we establish several identities.

$$\begin{aligned}&\int {\mathcal {F}}_t[\varPsi ] R_t^2dx = 0. \\&\int {\mathcal {F}}_t[\varPsi _1]{\mathcal {F}}_t[\varPsi _2] R_t^2dx=\int {\mathcal {F}}_t[\varPsi _1]\varPsi _2 R_t^2dx=\int {\mathcal {F}}_t[\varPsi _2]\varPsi _1 R_t^2dx. \end{aligned}$$

Lemma 4

We have the following observations:

$$\begin{aligned}&\int (\partial _t T_t){\mathcal {F}}_t[\varPhi _t] R_t^2 dx+\frac{1}{2} \int T_t ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx \ge - \int ({\mathcal {F}}_t[\varPhi _t] )^2 R_t^2dx, \end{aligned}$$
(52)
$$\begin{aligned}&\int (\partial _t T_t )T_t R_t^2 dx=-\int T_t \varPhi _t R_t^2 dx-\frac{1}{2}\int T_t^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx. \end{aligned}$$
(53)

Proof

We note that

$$\begin{aligned}\int T_t^2R_t^2 dx = 4H_t^2,\end{aligned}$$

and

$$\begin{aligned} \int ({\mathcal {F}}_t[R^*R_t^{-1}])^2 R_t^2 dx=\frac{\sin ^2(H_t)}{4H_t^2}\int T_t^2R_t^2 dx=\sin (H_t^2). \end{aligned}$$

We compute the derivatives as follows:

$$\begin{aligned} \partial _t H_t=&-\frac{1}{\sin H_t}\partial _t{\int R_tR^*dx}=-\frac{1}{2\sin H_t}\int R_tR^*{\mathcal {F}}_t[\varPhi _t] dx. \\ \partial _t T_t =&-\frac{1}{\sin H_t}\left( \int R_tR^*{\mathcal {F}}_t[\varPhi _t] dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{\sin ^2(H_t)}(R^*R_t^{-1}-\cos (H_t))\\&+\frac{2H_t}{\sin (H_t)}\left( -\frac{1}{2}R^*R_t^{-1}{\mathcal {F}}_t[\varPhi _t]-\frac{1}{2}\int R_tR^*{\mathcal {F}}_t[\varPhi _t] dx\right) \\ =&-\frac{1}{\sin H_t}\left( \int R^* R_t{\mathcal {F}}_t[\varPhi _t] dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{\sin ^2(H_t)}{\mathcal {F}}_t[R^* R_t^{-1}]\\&-\frac{H_t}{\sin (H_t)}\left( R^*R_t^{-1}{\mathcal {F}}_t[\varPhi _t]+\int R_tR^*{\mathcal {F}}_t[\varPhi _t] dx\right) . \end{aligned}$$

For the first inequality, we have

$$\begin{aligned} \begin{aligned} \int ( \partial _t T_t){\mathcal {F}}_t[\varPhi _t] R_t^2 dx&=-\frac{1}{\sin (H_t)}\left( \int R^* R_t{\mathcal {F}}_t[\varPhi _t] dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{\sin ^2(H_t)}\int {\mathcal {F}}_t[R^* R_t^{-1}] {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&\quad -\frac{H_t}{\sin (H_t)}\int (R^*R_t^{-1}{\mathcal {F}}_t[\varPhi _t]) {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&=-\frac{\sin (H_t)-H_t\cos (H_t)}{\sin ^3(H_t)}\left( \int {\mathcal {F}}_t[R^* R_t^{-1}]{\mathcal {F}}_t[\varPhi _t] R_t^2dx\right) ^2\\&\quad -\frac{1}{2}\frac{2H_t}{\sin (H_t)}\int R^*R_t^{-1}{\mathcal {F}}_t[\varPhi _t] {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&\ge -\frac{\sin (H_t)-H_t\cos (H_t)}{\sin ^3(H_t)}\left( \int ({\mathcal {F}}_t[\varPhi _t])^2 R_t^2 dx\right) \left( \int ({\mathcal {F}}_t[R^*R_t^{-1}])^2 R_t^2 dx\right) \\&\quad -\frac{1}{2}\frac{2H_t}{\sin (H_t)}\int (R^*R_t^{-1}-\cos (H_t))({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&\quad -\frac{1}{2}\frac{2H_t}{\sin (H_t)}\int \cos (H_t)({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&= -\frac{\sin (H_t)-H_t\cos (H_t)}{\sin (H_t)}\left( \int ({\mathcal {F}}_t[\varPhi _t])^2 R_t^2 dx\right) -\frac{1}{2} \int T_t ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&\quad -\frac{H_t\cos (H_t)}{\sin (H_t)}\int R_t^2({\mathcal {F}}_t[\varPhi _t])^2 dx\\&=-\frac{1}{2} \int T_t ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx-\int ({\mathcal {F}}_t[\varPhi _t])^2 R_t^2 dx. \end{aligned} \end{aligned}$$

The inequality is based on Cauchy inequality. For the second inequality, we have

$$\begin{aligned} \begin{aligned} \int ( \partial _t T_t) T_t R_t^2 dx&=-\frac{1}{\sin H_t}\left( \int R^* R_t{\mathcal {F}}_t[\varPhi _t] dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{\sin ^2(H_t)}\int T_t{\mathcal {F}}_t[R^* R_t^{-1}] R_t^2 dx\\&\quad -\frac{H_t}{\sin (H_t)}\int T_t R^*R_t^{-1} {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&=-\frac{1}{\sin H_t}\left( \int R^* R_t{\mathcal {F}}_t[\varPhi _t] dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{2\sin (H_t) H_t}\int T_t^2R_t^2 dx\\&\quad -\frac{1}{2}\frac{2H_t}{\sin (H_t)}\int (R^* R_t -\cos (H_t)) T_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\frac{1}{2}\frac{2H_t\cos (H_t)}{\sin (H_t)}\int T_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&=-\frac{1}{2 H_t}\left( \int T_t \varPhi _t R_t^2 dx\right) \frac{\sin (H_t)-H_t\cos (H_t) }{2\sin (H_t) H_t}\int T_t^2R_t^2 dx\\&\quad -\frac{1}{2}\int T_t^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\frac{H_t\cos (H_t)}{\sin (H_t)}\int T_t \varPhi _t R_t^2 dx\\&=-\left( \frac{\sin (H_t)-H_t\cos (H_t) }{\sin (H_t)}+\frac{H_t\cos (H_t)}{\sin (H_t)}\right) \int T_t \varPhi _t R_t^2 dx-\frac{1}{2}\int T_t^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&=-\int T_t \varPhi _t R_t^2 dx-\frac{1}{2}\int T_t^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx. \end{aligned} \end{aligned}$$

This completes the proof. \(\square \)

Hence, we can compute that

$$\begin{aligned} \begin{aligned} e^{-\sqrt{\beta }t}\partial _t{\mathcal {E}}(t) =&\frac{\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx+\int {\mathcal {F}}_t[\varPhi _t]\left( {\mathcal {F}}_t[\partial _t\varPhi _t]-\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPhi _t dx\right) R_t^2 dx\\&+\frac{1}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\beta \int (\varPhi _t-{\mathbb {E}}_{R_t^2}[\varPhi _t])T_t\rho _tdx\\&-\sqrt{\beta }\int \left( {\mathcal {F}}_t[\partial _t\varPhi _t]-\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPhi _t dx\right) T_tR_t^2 dx \\&-\sqrt{\beta }\int \partial _tT_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\sqrt{\beta }\int ({\mathcal {F}}[\varPhi _t])^2T_t R_t^2 dx \\&+\frac{\beta \sqrt{\beta }}{2}\int T_t^2R_t^2dx+\beta \int \partial _t T_t T_t R_t^2dx+\frac{\beta }{2} \int T_t^2 {\mathcal {F}}_t[\varPhi _t]R_t^2 dx\\&+ \sqrt{\beta } (E(\rho _t)-E(\rho ^*))+\int {\mathcal {F}}_t[\varPhi _t]{\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] R_t^2 dx. \end{aligned} \end{aligned}$$

From Proposition 11, we have

$$\begin{aligned} \begin{aligned}&\sqrt{\beta } (E(\rho _t)-E(\rho ^*))+\frac{\beta \sqrt{\beta }}{2}\int T_t^2R_t^2dx\le -\sqrt{\beta } \int {\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] T_t\rho _tdx. \end{aligned} \end{aligned}$$

We first compute terms with coefficient \(\beta ^0\). We have

$$\begin{aligned} \begin{aligned}&\int {\mathcal {F}}_t[\varPhi _t]\left( {\mathcal {F}}_t[\partial _t\varPhi _t]-\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPhi _t dx\right) R_t^2 dx\\&\qquad +\frac{1}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx+\int {\mathcal {F}}_t[\varPhi _t]{\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] R_t^2 dx\\&\quad =\int {\mathcal {F}}_t[\varPhi _t] \partial _t\varPhi _t R_t^2dx+\frac{1}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2 {\mathcal {F}}_t[\varPhi _t] R_t^2 dx+\int {\mathcal {F}}_t[\varPhi _t]\frac{\delta E}{\delta \rho _t} R_t^2 dx\\&\quad =\int {\mathcal {F}}_t[\varPhi _t]\left( -\sqrt{\beta }\varPhi _t-\frac{1}{2}\varPhi _t^2+{\mathbb {E}}_{R_t^2}[\varPhi _t]\varPhi _t+\frac{1}{2}{\mathcal {F}}_t[\varPhi _t]^2\right) R_t^2 dx\\&\quad =\int {\mathcal {F}}_t[\varPhi _t]\left( -\sqrt{\beta }\varPhi _t+\frac{1}{2}({\mathbb {E}}_{R_t^2}[\varPhi _t])^2\right) R_t^2 dx\\&\quad =-2\sqrt{\beta } \int {\mathcal {F}}_t[\varPhi _t] \varPhi _t R_t^2 dx. \end{aligned} \end{aligned}$$

We then proceed to compute terms with coefficient \(\beta ^{1/2}\).

$$\begin{aligned} \begin{aligned}&\frac{\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx-\sqrt{\beta }\int \left( {\mathcal {F}}_t[\partial _t\varPhi _t]-\int R_t^2 {\mathcal {F}}_t[\varPhi _t] \varPhi _t dx\right) T_tR_t^2 dx\\&\qquad -2\sqrt{\beta } \int {\mathcal {F}}_t[\varPhi _t] \varPhi _t R_t^2 dx-\sqrt{\beta }\int \partial _tT_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\sqrt{\beta }\int ({\mathcal {F}}[\varPhi _t])^2T_t R_t^2 dx\\&\qquad -\sqrt{\beta } \int {\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] T_t\rho _tdx\\&\quad =-\frac{3\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx-\sqrt{\beta }\int \partial _t\varPhi _t T_tR_t^2 dx-\sqrt{\beta }\int \partial _tT_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx\\&\qquad -\sqrt{\beta }\int ({\mathcal {F}}[\varPhi _t])^2T_t R_t^2 dx-\sqrt{\beta } \int \frac{\delta E}{\delta \rho _t}T_tR_t^2dx\\&\quad =-\sqrt{\beta } \int T_tR_t^2\left( \partial _t\varPhi _t+\frac{\delta E}{\delta \rho _t}+\frac{1}{2}({\mathcal {F}}[\varPhi _t])^2\right) -\frac{3\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&\qquad -\sqrt{\beta }\int \partial _tT_t {\mathcal {F}}_t[\varPhi _t] R_t^2 dx-\frac{\sqrt{\beta }}{2}\int ({\mathcal {F}}[\varPhi _t])^2 T_t R_t^2 dx\\&\quad \le 2\beta \int T_t\varPhi _t R_t^2-\frac{\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx. \end{aligned} \end{aligned}$$

The last inequality is based on Lemma 4. Finally, we compute terms with coefficient \(\beta \):

$$\begin{aligned} \begin{aligned}&2\beta \int T_t\varPhi _t R_t^2dx-\beta \int \varPhi _t T_tR_t^2 dx+\beta \int \partial _t T_t T_t R_t^2 dx+ \frac{\beta }{2} \int T_t^2 {\mathcal {F}}_t[\varPhi _t]R_t^2 dx=0. \end{aligned} \end{aligned}$$

In summary, we have

$$\begin{aligned} e^{-\sqrt{\beta }t}\partial _t{\mathcal {E}}(t) \le -\frac{\sqrt{\beta }}{2}\int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\le 0. \end{aligned}$$

For convex \(E(\rho )\), we let \(\alpha _t=3/t\). Consider

$$\begin{aligned} {\mathcal {E}}(t)=\frac{1}{2} \int \left( -T_t+\frac{t}{2}\varPhi _t\right) ^2R_t^2dx+\frac{t^2}{4}(E(R_t^2)-E(\rho ^*)). \end{aligned}$$

We can compute that

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t) =&\int (\partial _t T_t) T_t R^2_t dx+\frac{1}{2}\int T_t^2{\mathcal {F}}[\varPhi _t] R_t^2dx-\frac{1}{2}\int T_t\varPhi _tR_t^2dx\\&-\frac{t}{2}\int T_t\left( \partial _t\varPhi _t\right) R_t^2dx- \frac{t}{2}\int (\partial _t T_t)\varPhi _t R_t^2 dx\\&-\frac{t}{2}\int T_t({\mathcal {F}}_t[\varPhi _t])^2 R_t^2 dx+\frac{t}{4} \int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&+\frac{t^2}{4} \int (\partial _t {\mathcal {F}}_t[\varPhi _t]) {\mathcal {F}}_t[\varPhi _t] R_t^2dx+\frac{t^2}{8} \int ({\mathcal {F}}_t[\varPhi _t])^3 R_t^2 dx\\&-\frac{t^2}{4}\int {\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] {\mathcal {F}}_t[\varPhi _t]R_t^2dx+\frac{t}{2}(E(R^2_t)-E(\rho ^*)).\\ \end{aligned} \end{aligned}$$

Because \(E(\rho )\) is convex, we have

$$\begin{aligned} E(R^2_t)-E(\rho ^*)\le -\int {\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] T_tR_t^2dx. \end{aligned}$$

From Lemma 4, we have

$$\begin{aligned} \begin{aligned} {{\dot{{\mathcal {E}}}}}(t) \le&-\frac{3}{2}\int T_t\varPhi _tR_t^2dx-\frac{t}{2}\int T_t\left( \partial _t\varPhi _t\right) R_t^2dx \\&-\frac{t}{4}\int T_t({\mathcal {F}}_t[\varPhi _t])^2 R_t^2 dx+\frac{3t}{4} \int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx\\&+\frac{t^2}{4} \int (\partial _t \varPhi _t) {\mathcal {F}}_t[\varPhi _t] R_t^2dx+\frac{t^2}{8} \int ({\mathcal {F}}_t[\varPhi _t])^3 R_t^2 dx\\&-\frac{t^2}{4}\int \frac{\delta E}{\delta \rho _t}{\mathcal {F}}_t[\varPhi _t]R_t^2dx-\frac{t}{2}\int {\mathcal {F}}_t\left[ \frac{\delta E}{\delta \rho _t}\right] T_tR_t^2dx\\ =&-\frac{3}{2}\int T_t\varPhi _tR_t^2dx-\frac{t}{2}\int T_tR_t^2\left( \partial _t\varPhi _t+\frac{1}{2}({\mathcal {F}}_t[\varPhi _t])^2+\frac{\delta E}{\delta \rho _t}\right) \\&+\frac{3t}{4} \int ({\mathcal {F}}_t[\varPhi _t])^2R_t^2dx+\frac{t^2}{4}\int {\mathcal {F}}_t[\varPhi _t] R_t^2\left( \partial _t\varPhi _t+\frac{1}{2}({\mathcal {F}}_t[\varPhi _t])^2+\frac{\delta E}{\delta \rho _t}\right) dx =0. \end{aligned} \end{aligned}$$

The last equality utilize the fact that \(\partial _t\varPhi _t+\frac{1}{2}({\mathcal {F}}_t[\varPhi _t])^2+\frac{\delta E}{\delta \rho _t}=-\frac{3}{t}\varPhi _t\).

Discrete-Time Algorithm of AIG Flows

In this section, we introduce the discrete-time algorithm for Kalman-Wasserstein AIG flows and Stein AIG flows. Here \(E(\rho )\) is the KL divergence from \(\rho \) to \(\rho ^*\propto \exp (-f)\).

1.1 Discrete-Time Algorithm of KW-AIG Flows

For KL divergence, the particle formulation (5) of KW-AIG flows writes

$$\begin{aligned} \left\{ \begin{aligned}&dX_t = C^\lambda (\rho _t)V_t dt,\\&dV_t =-\alpha _tV_t dt-{\mathbb {E}}[V_tV_t^T](X_t-{\mathbb {E}}[X_t])dt-(f(X_t)+\nabla \log \rho _t(X_t)) dt. \end{aligned}\right. \end{aligned}$$
(54)

Consider a particle system \(\{X_0^i\}_{i=1}^N. \)In k-th iteration, the update rule follows: for \(i=1,2,\dots N\),

$$\begin{aligned} \left\{ \begin{aligned}&X_{k+1}^i =X_{k}^i +\sqrt{\tau _k}C^\lambda _k V_k,\\&V_{k+1} =\alpha _k V_k-\sqrt{\tau _k}\left[ \sum _{i=1}^N(V_k^i)(V_k^i)^T\right] (X_{k}^i-m_k)-\sqrt{\tau _k}(f(X_k^i)+\xi _k(X_k^i)). \end{aligned}\right. \end{aligned}$$
(55)

here \(\xi _k\) is an approximation of \(\nabla \log \rho _k\) and we denote

$$\begin{aligned} m_k = \frac{1}{N}\sum _{i=1}^N X_k^i,\quad C^\lambda _k = \frac{1}{N-1}\sum _{i=1}^N (X_k^i-m_k)(X_k^i-m_k)^T+\lambda I. \end{aligned}$$

The choice of \(\alpha _k\) is similar to the discrete-time algorithm of W-AIG flows. If \(E(\rho )\) is \(\beta \)-strongly convex, then \(\alpha _k = \frac{1-\sqrt{\beta \tau _k}}{1+\sqrt{\beta \tau _k}}\); if \(E(\rho )\) is convex or \(\beta \) is unknown, then \(\alpha _k = \frac{k-1}{k+2}\).

About the adaptive restart technique, the restarting criterion follows

$$\begin{aligned} \varphi _k = - \sum _{i=1}^N\left\langle C^\lambda _kV_{k+1}^i, \nabla f(X_k^i)+\xi _k(X_k^i)\right\rangle . \end{aligned}$$
(56)

The overall algorithm is summarized as follows.

figure b

1.2 Discrete-Time Algorithm for S-AIG Flows

For KL divergence, the particle formulation of S-AIG flows writes

$$\begin{aligned} \left\{ \begin{aligned}&\frac{d}{dt} X_t = \int k(X_t,y) \nabla \varPhi _t(y)\rho _t(y)dy,\\&\frac{d}{dt} V_t =-\alpha _t V_t-\int V_t^T\nabla \varPhi _t (y) \nabla _x k(X_t,y) \rho _t(y)dy-\nabla f(X_t)-\nabla \log \rho _t. \end{aligned}\right. \end{aligned}$$
(57)

Consider a particle system \(\{X_0^i\}_{i=1}^N\). In k-th iteration, the update rule follows: for \(i=1,2,\dots N\),

$$\begin{aligned} \left\{ \begin{aligned}&X_{k+1}^i = X_k^i+\frac{\sqrt{\tau _k}}{N}\sum _{j=1}^N k(X_k^i,X_k^j) V_{k+1}^j,\\&V_{k+1}^i =\alpha _k V_k^i-\frac{\sqrt{\tau _k}}{N}\sum _{j=1}^N (V_k^i)^TV_k^j\nabla _x k(X_k^i,X_k^j) -\sqrt{\tau _k}(\nabla f(X_k^i)+\xi _k(X_k^i)). \end{aligned}\right. \end{aligned}$$
(58)

here \(\xi _k\) is an approximation of \(\nabla \log \rho _k\). The choice of \(\alpha _k\) is similar, depending on the convexity of \(E(\rho )\) w.r.t. Stein metric.

About the adaptive restart technique, the restarting criterion follows

$$\begin{aligned} \varphi _k = - \sum _{i=1}^N\sum _{j=1}^Nk(X_k^j,X_k^i)\left\langle V_{k+1}^j, \nabla f(X_k^i)+\xi _k(X_k^i)\right\rangle . \end{aligned}$$
(59)

The overall algorithm is summarized as follows.

figure c

Implementation Details in the Numerical Experiments

In this section, we provide extra numerical experiments and elaborate on the implementation details in the numerical experiments.

1.1 Details in Subsection 6.1

We follow the same setting as [17], which is also adopted by [14, 15]. The dataset is split into \(80\%\) for training and \(20\%\) for testing. We use the stochastic gradient and the mini-batch size is taken as 100. For MCMC, the number of particles is \(N=1000\); for other methods, the number of particles is \(N=100\). The BM method is not applied to SVGD in selecting the bandwidth.

The initial step sizes for the compared methods are given in Table 3, which are selected by grid search over \(1\times 10^i\) with \(i=-3,-4,\dots ,-9\). (For SVGD, we use the initial step size in [17].) The step size of SVGD is adjusted by Adagrad, which is same as [17]. For WNAG and WRes, the step size is give by \(\tau _l = \tau _0/l^{0.9}\) for \(l\ge 1\). The parameters for WNAG and Wnes are identical to [15] and [14]. For other methods, the step size is multiplied by 0.9 every 100 iterations. For methods under Kalman-Wasserstein metric, we require a smaller step size (around 1e-8) to make the algorithm converge. For all discrete-time algorithms of AIGs, we apply the restart technique. We record the cpu-time for each method in Table 4. The computational cost of the BM method is much higher than the MED method because we need to evaluate the MMD of two particle systems several times in optimizing the subproblem. We may update the bandwidth using the BM method every 10 iterations to deal with the high computation cost of the BM method. On the other hand, using the MED method for bandwidth, the computational cost of S-AIG is much higher than other methods. This results from the multiple times of computation of particle interacting in updating \(X_k^i\) and \(V_k^i\).

Table 3 Initial step sizes for compared algorithms in Bayesian logistic regression
Table 4 Averaged cpu time(s) cost for algorithms in Bayesian logistic regression

1.2 Details in Subsection 6.2

We follow the setting of Bayesian neural network as [38]. The kernel bandwidth is adjusted by the MED method. We list the number of epochs and the batch size for each datasets in Table 5. For each dataset, we use \(90\%\) of samples as the training set and \(10\%\) of samples as the test set. The step size of SVGD is adjusted by Adagrad. For W-GF and W-AIG , the step size is multiplied by 0.64 every 1/10 of total epochs. We select the initial step size by grid search over \(\{1,2,5\}\times 10^i\) with \(i=-3,-4,\dots ,-7\) to ensure the best performance of compared methods. We list the initial step sizes for each dataset in Table 6. For W-AIG, we apply the adaptive restart.

Table 5 Number of epochs and batch size in Bayesian neural network
Table 6 Initial step sizes for compared methods in Bayesian neural network

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Li, W. Accelerated Information Gradient Flow. J Sci Comput 90, 11 (2022). https://doi.org/10.1007/s10915-021-01709-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-021-01709-3

Keywords

Navigation