We present a framework for Nesterov’s accelerated gradient flows in probability space to design efficient mean-field Markov chain Monte Carlo algorithms for Bayesian inverse problems. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.
Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Understanding and accelerating particle-based variational inference. In: International Conference on Machine Learning, pp. 4082–4092. (2019)
Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J., Carin, L.: Accelerated first-order methods on the Wasserstein space for Bayesian inference. arXiv preprint arXiv:1807.01750 (2018)
Liu, Q.: Stein variational gradient descent as gradient flow. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 3115–3123. Curran Associates, Inc., US (2017)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386. (2016)
Liu, Y., Shang, F., Cheng, J., Cheng, H., Jiao, L.: Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4868–4877. (2017)
Ma, Y.A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I.: Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996 (2019)
Malago, L., Matteucci, M., Pistone, G.: Natural gradient, fitness modelling and model selection: aunifying perspective. In: 2013 IEEE Congress on Evolutionary Computation, pp. 486–493. IEEE (2013)
Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)
Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. arXiv preprint arXiv:1601.01875 (2016)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27(2), 372–376 (1983)
Su, W., Boyd, S., Candés, E.J.: A differential equation for modeling Nesterovs accelerated gradient method: theory and insights. J. Mach. Learn. Res. 27, 2510–2518 (2016)
Taghvaei, A., Mehta, P.G.: Accelerated flow for probability distributions. arXiv preprint arXiv:1901.03317 (2019)
Takatsu, Asuka: On Wasserstein geometry of the space of Gaussian measures. arXiv preprint arXiv:0801.2250 (2008)
Villani, C: Topics in optimal transportation. American Mathematical Soc. (2003)
Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Advances in Neural Information Processing Systems, pp. 7834–7844 (2019)
Wang, Y., Jia, Z., Wen, Z.: The search direction correction makes first-order methods faster. arXiv preprint arXiv:1905.06507 (2019)
Wang, Yifei, Li, Wuchen: Information newton’s flow: second-order optimization method in probability space. arXiv preprint arXiv:2001.04341 (2020)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
In this appendix, we formulate detailed derivations of examples and proofs of propositions. We also design particle implementations of KW-AIG flows, S-AIG flows and provide detailed implementations of experiments.
Euler-Lagrange Equation, Hamiltonian Flows and AIG Flows
In this section, we review and derive Euler-Lagrange equation, Hamiltonian flows and Euler-Lagrange formulation of AIG flows in probability space.
1.1 Derivation of the Euler-Lagrange Equation
In this subsection, we derive the Euler-Lagrange equation in probability space. For a given metric \(g_\rho \) in probability space, we can define a Lagrangian by
Let \(h_t\in {\mathcal {F}}(\varOmega )\) be the smooth perturbation function that satisfies \(\int h_t dx=0, t\in \left[ 0,T\right] \) and \(h_t|_{t=0}=h_t|_{t=T}\equiv 0\). Denote \(\rho _t^\epsilon = \rho _t+\epsilon h_t\). Note that we have the Taylor expansion
Because \(\int h_t dx=0\), the Euler-Lagrange equation holds with a spatially constant function C(t). \(\square \)
1.2 Derivation of Hamiltonian Flow
In this subsection, we derive the Hamiltonian flow in the probability space. Denote \(\varPhi _t = \delta L/\delta (\partial _t \rho _t)=G(\rho _t)\partial _t\rho _t\). Then, the Euler-Lagrange equation can be formulated as a system of \((\rho _t,\varPhi _t)\), i.e.,
First, we give a useful identity. Given a metric tensor \(G(\rho ):T_\rho {\mathcal {P}}(\varOmega )\rightarrow T_\rho ^*{\mathcal {P}}(\varOmega )\), we have
Let \({{\tilde{\rho }}}_t = \rho _t+\epsilon h\), where \(h\in T_{\rho _t}{\mathcal {P}}(\varOmega )\). For all \(\sigma \in T_{\rho _t}{\mathcal {P}}\), it follows
$$\begin{aligned} G(\rho _t+\epsilon h )^{-1}G(\rho _t+\epsilon h ) \sigma = \sigma . \end{aligned}$$
The first-order derivative w.r.t. \(\epsilon \) of the left hand side shall be 0, i.e.,
This equation combining with \(\partial _t\rho _t=G(\rho )^{-1}\varPhi _t\) recovers the Hamiltonian flow. In short, the Euler-Lagrange equation is from the primal coordinates \((\rho _t,\partial _t\rho _t)\) and the Hamiltonian flow is from the dual coordinates \((\rho _t,\varPhi _t)\). Similar interpretations can be found in [8].
1.3 The Euler-Lagrangian Formulation of AIG Flows
We can formulate the AIG flow as a second-order equation of \(\rho _t\),
We next present several examples of Hamiltonian flows w.r.t. different metrics. The derivations simply follow from the definition of the given information metric and the formulations given in Appendix A.2.
Example 16
(Fisher-Rao Hamiltonian flow) The Fisher-Rao Hamiltonian flow follows
In this section, we first introduce the Wasserstein metric, gradient flows and Hamiltonian flows in Gaussian families. Then, we validate the existence of (W-AIG) in Gaussian families. Denote \({\mathcal {N}}_n^0\) to the multivariate Gaussian densities with zero means. Namely, if \(\rho _0, \rho ^*\in {\mathcal {N}}_n^0\), then we show that (W-AIG) has a solution \((\rho _t,\varPhi _t)\) and \(\rho _t\in {\mathcal {N}}_n^0\).
Let \({\mathbb {P}}^n\) and \({\mathbb {S}}^n\) represent symmetric positive definite matrix and symmetric matrix with size \(n\times n\) respectively. Each \(\rho \in {\mathcal {N}}_n^0\) is uniquely determined by its covariance matrix \(\varSigma \in {\mathbb {P}}^n\). The Wasserstein metric \(G^W(\rho )\) on \({\mathcal {P}}({\mathbb {R}}^n)\) induces the Wasserstein metric \(G^W(\varSigma )\) on \({\mathbb {P}}^n\), which is also known as the Bures metric, see [22, 24, 36]. For \(\varSigma \in {\mathbb {P}}^n\), the tangent and cotangent space follow \(T_{\varSigma }{\mathbb {P}}^n\simeq T^*_{\varSigma }{\mathbb {P}}^n\simeq {\mathbb {S}}^n\).
Definition 3
(Wasserstein metric in Gaussian families) For \(\varSigma \in {\mathbb {P}}^n\), the metric tensor \(G^W(\varSigma ):{\mathbb {S}}^n\rightarrow {\mathbb {S}}^n\) is defined by
The derivation of the gradient flow simply follows the definition of Wasserstein metric in Gaussian families.
We then derive the Hamiltonian flow as follows. For \(A\in {\mathbb {S}}^n\), we define the linear operator \(M_A:{\mathbb {S}}^n\rightarrow {\mathbb {S}}^n\) by
It is easy to verify that if \(A\in {\mathbb {P}}^n\), then \(M_A^{-1}\) is well-defined. For a flow \(\varSigma _t\in {\mathbb {P}}^n, t\ge 0\), we define the Lagrangian \(L(\varSigma _t,{\dot{\varSigma }}_t)=\frac{1}{2}g_{\varSigma _t}({{\dot{\varSigma }}}_t,{\dot{\varSigma }}_t)-E(\varSigma _t).\) The corresponding Euler-Lagrange equation writes
This leads to \(\frac{d L}{d {\dot{\varSigma }}_t}=\frac{1}{2}M_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t=S_t.\) For simplicity, we denote \(g=g_{\varSigma _t}({{\dot{\varSigma }}}_t,{{\dot{\varSigma }}}_t)\). First, we show that
Because \(S_t=\frac{1}{2}M_{\varSigma _t}^{-1}{{\dot{\varSigma }}}_t\). Given \({{\dot{\varSigma }}}_t\), \(S_t\) can be viewed as a continuous function of \(\varSigma _t\). For any \(A\in {\mathbb {S}}^n\), define \(l_A={\text {tr}}((\varSigma _t S_t+S_t \varSigma _t)A)\).
here we view \(\partial S_T/\partial \varSigma _t\) as a linear operator on \(S^n\). Let \(B=A\varSigma _t+\varSigma _t A\), then \(A=M_{\varSigma _t}^{-1}B\). \(\frac{\partial S_t}{\partial \varSigma _t} B+M_{S_t}M_{\varSigma _t}^{-1}B=0\) holds for all \(B\in S^n\). Therefore, we have \(\frac{\partial S_t}{\partial \varSigma _t}=-M_{S_t}M_{\varSigma _t}^{-1}\). Hence,
By adding a damping term \(\alpha _tS_t\), we derive (W-AIG-G), i.e., the Wasserstein AIG flow in Gaussian families. We present the proof of Proposition 2 as follows. We first show that \(\varSigma _t\) stays in \({\mathbb {P}}^n\). Suppose that \(\varSigma _t\in {\mathbb {P}}_n\) for \(0\le t\le T\). Define \(H_t=H(\varSigma _t,S_t)=2{\text {tr}}(S_t\varSigma _t S_t)+E(\varSigma _t)\). We observe that (W-AIG-G) is equivalent to
This means that as long as \(\varSigma _t\in {\mathbb {P}}_n\), the smallest eigenvalue of \(\varSigma _t\) has a positive lower bound. If there exists \(T>0\) such that \(\varSigma _T\notin {\mathbb {P}}_n\). Because \(\varSigma _t\) is continuous with respect to t, there exists \(T_1<T\), such that \(\varSigma _t\in P_n\), \(0\le t\le T_1\) and \(\lambda _{T_1}<\exp \left( -2H(0)/n-1\right) \), which violates (31).
We then reveal the relationship between (W-AIG) in \({\mathcal {P}}({\mathbb {R}}^n)\) and \({\mathbb {P}}^n\). We observe that
Note that \(\rho ^*\) is the Gaussian density with the covariance matrix \(\varSigma ^*\). Because \(\dot{C}(t) = \frac{1}{2}\log \det (\varSigma _tW^*)-1\), we can compute
Therefore, the second equation of (W-AIG) holds. Because \(\varSigma _t|_{t=0}=\varSigma _0\), \(S_t|_{t=0}=0\) and \(C(0)=0\), we have \(\rho _t|_{t=0}=\rho _0\) and \(\varPhi _t|_{t=0}=0\). This completes the proof.
Proof of Convergence Rate Under Wasserstein Metric
In this section, we briefly review the Riemannian structure of probability space and present proofs of propositions in Sect. 4 under Wasserstein metric.
1.1 A Brief Review on the Geometric Properties of the Probability Space
Suppose that we have a metric \(g_\rho \) in probability space \({\mathcal {P}}(\varOmega )\). Given two probability densities \(\rho _0,\rho _1\in {\mathcal {P}}(\varOmega )\), we define the distance as follows
The minimizer \({{\hat{\rho }}}_s\) of the above problem is defined as the geodesic curve connecting \(\rho _0\) and \(\rho _1\). An exponential map at \(\rho _0\in {\mathcal {P}}(\varOmega )\) is a mapping from the tangent space \(T_{\rho _0}{\mathcal {P}}(\varOmega )\) to \({\mathcal {P}}(\varOmega )\). Namely, \(\sigma \in T_{\rho _0}{\mathcal {P}}(\varOmega )\) is mapped to a point \(\rho _1\in {\mathcal {P}}(\varOmega )\) such that there exists a geodesic curve \({{\hat{\rho }}}_s\) satisfying \({{\hat{\rho }}}_s|_{s=0}=\rho _0,\partial _s\hat{\rho }_s|_{s=0}=\sigma ,\) and \({{\hat{\rho }}}_s|_{s=1}=\rho _1\).
1.2 The Inverse of Exponential Map
In this subsection, we characterize the inverse of exponential map in the probability space with the Wasserstein metric.
Proposition 7
Denote the geodesic curve \(\gamma (s)\) that connects \(\rho _t\) and \(\rho ^*\) by \(\gamma (s)=(sT_t+(1-s){\text {Id}})\#\rho _t,\,s\in [0,1]\). Here \({\text {Id}}\) is the identity mapping from \({\mathbb {R}}^n\) to itself. Then, \(\partial _s\gamma (s)|_{s=0}\) corresponds to a tangent vector \(-\nabla \cdot (\rho _t(x) (T_t(x)-x))\in T_{\rho _t}{\mathcal {P}}(\varOmega )\).
For simplicity, we denote \(T_t^{s}=(sT_t+(1-s){\text {Id}})^{-1},s\in \left[ 0,1\right] \). Based on the theory of optimal transport [37], we can write the explicit formula of the geodesic curve \(\gamma (s)\) by
The main goal of this subsection is to prove the Lyapunov function \({\mathcal {E}}(t)\) is non-increasing.
Preparations We first give a better characterization of the optimal transport plan \(T_t\). We can write \(T_t=\nabla \varPsi _t\), where \(\varPsi _t\) is a strictly convex function, see [37]. This indicates that \(\nabla T_t\) is symmetric. We then introduce the following proposition.
Proposition 8
Suppose that \(E(\rho )\) satisfies Hess(\(\beta \)) for \(\beta \ge 0\). Let \(T_t(x)\) be the optimal transport plan from \(\rho _t\) to \(\rho ^*\), then
Because \((T_t)^{-1}\#{\rho ^*}={\rho _t}\), let \(u_t=\partial _t(T_t)^{-1}\circ T_t\) and \(X_t=(T_t)^{-1}X_0\), where \(X_0\sim \rho ^*\). This yields \(\frac{d}{dt}X_t=u_t(X_t)\). The distribution of \(X_t\) follows \(\rho _t\). By the Euler’s equation, \(\rho _t\) shall follows
We first notice that \(u_t-\nabla \varPhi _t\) is divergence-free in term of \(\rho _t\). From \(-\nabla T_t u_t = \partial _tT_t = \nabla \partial _t\varPsi _t\), we observe that \(-\nabla T_t u_t\) is the gradient of \(\partial _t\varPsi _t\). Therefore,
Based on our previous characterization on the optimal transport plan \(T_t\), \(\nabla T_t = \nabla ^2\varPsi _t\) is symmetric positive definite. This yields that
The last inequality utilizes that \(\nabla T_t\) is positie definite and \(\rho _t\) is non-negative. Then, we prove the equality in Lemma 3. Because \(\nabla T_t(x)(T_t(x)-x) = \frac{1}{2}\nabla ( \Vert T_t(x)-x\Vert ^2+T_t(x)-\Vert x\Vert ^2)\) is a gradient. Similarly, it follows
where the last equality uses (W-AIG) with \(\alpha _t=2\sqrt{\beta }\). Substituting (37) and (38) into the expression of \({{\dot{{\mathcal {E}}}}}(t)e^{-\sqrt{\beta }t}\) yields
Utilizing the inequality (44) and substituting the expressions of terms involving \(\partial _t T_t\) and \(\nabla \cdot (\rho _t\nabla \varPhi _t)\) in (43) with the expressions in (34) (35) and (40) (41), we obtain
here the target distribution satisies \(\rho _\infty (x)=\rho ^*(x)\propto \exp (-f(x))\). Suppose that we take \(\alpha _t = \log p-\log t\), \(\beta _t = p\log t +\log C\) and \(\gamma _t = p\log t\). Here we specify \(p=2\) and \(C=1/4\). Then the accelerated flow (46) recovers the particle formulation of W-AIG flows if we replace \(Y_t\) by \(2t^{-3}V_t\). The Lyapunov function in [35] follows
The last equality is based on the fact that \(V_t = \nabla \varPhi _t(X_t)\) and \(T_t=T_{\rho _t}^{\rho ^*}\) is the optimal transport plan from \(\rho _t\) to \(\rho ^*\). This indicates that the Lyapunov function in [35] is identical to ours. The technical assumption in [35] follows
In 1-dimensional case, because \(\nabla \cdot \left( \rho _t(u_t-\nabla \varPhi _t)\right) =0\) indicates that \(\rho _t(u_t-\nabla \varPhi _t)=0\). For \(\rho _t(x)>0\), we have \(u_t(x)-\nabla \varPhi _t(x) = 0\). So the technical assumption holds. In general cases, although \(u_t = \partial _t(T_t)^{-1}\circ T_t\) satisfies \(\nabla \cdot \left( \rho _t(u_t-\nabla \varPhi _t)\right) =0\), but this does not necessary indicate that \(u_t=\nabla \varPhi _t\). Hence, \({\mathbb {E}}\left[ \left( X_t+e^{-\gamma _t}Y_t-T_{\rho _t}^{\rho _\infty }(X_t)\right) \cdot \frac{d}{dt}T_{\rho _t}^{\rho _\infty }(X_t)\right] =0\) does not necessary hold except for 1-dimensional case.
Proof of Convergence Rate Under Fisher-Rao Metric
In this section, we present proofs of propositions in Sect. 4 under Fisher-Rao metric.
1.1 Geodesic Curve Under the Fisher-Rao Metric
We first investigate on the explicit solution of geodesic curve under the Fisher-Rao metric in probability space. The geodesic curve shall satisfy
We observe that \(\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t^2]-\frac{1}{2}{\mathbb {E}}_{R_t^2}[\varPhi _t]^2={\mathcal {H}}(\rho _t, \varPhi _t)\) is the Hamiltonian, which is invariant along the geodesic curve. Denote
Suppose that \(\rho _0,\rho _1>0\), \(\rho _0\ne \rho _1\). Then, there exists a geodesic curve \(\rho (t)\) with \(\rho _t|_{t=0}=\rho _0\) and \(\rho _t|_{t=1}=\rho _1\).
Proof
We denote \(R_0(x)=\sqrt{\rho _0(x)}\) and \(R_1(x)=\sqrt{\rho _1(x)}\). We only need to solve A(x) and \(H>0\) such that
We note that the manifold \(({\mathcal {P}}^+(\varOmega ), {\mathcal {G}}^{FR}(\rho ))\) is homeomorphic to the manifold \((S^+(\varOmega ), {\mathcal {G}}^{E}(R))\), where \(S^+(\varOmega )=\{R\in {\mathcal {F}}(\varOmega ):R>0,\int R^2 dx =1\}\). Here \((S^+(\varOmega ), {\mathcal {G}}^{E}(R))\) is the submanifold to \({\mathbb {L}}^2(\varOmega )\) equiped with the standard Euclidean metric.
Then, we prove the convergence results for \(\beta \)-strongly convex \(E(\rho )\). Here we take \(\alpha _t=2\sqrt{\beta }\). Consider the Lyapunov function
The last equality utilize the fact that \(\partial _t\varPhi _t+\frac{1}{2}({\mathcal {F}}_t[\varPhi _t])^2+\frac{\delta E}{\delta \rho _t}=-\frac{3}{t}\varPhi _t\).
Discrete-Time Algorithm of AIG Flows
In this section, we introduce the discrete-time algorithm for Kalman-Wasserstein AIG flows and Stein AIG flows. Here \(E(\rho )\) is the KL divergence from \(\rho \) to \(\rho ^*\propto \exp (-f)\).
1.1 Discrete-Time Algorithm of KW-AIG Flows
For KL divergence, the particle formulation (5) of KW-AIG flows writes
The choice of \(\alpha _k\) is similar to the discrete-time algorithm of W-AIG flows. If \(E(\rho )\) is \(\beta \)-strongly convex, then \(\alpha _k = \frac{1-\sqrt{\beta \tau _k}}{1+\sqrt{\beta \tau _k}}\); if \(E(\rho )\) is convex or \(\beta \) is unknown, then \(\alpha _k = \frac{k-1}{k+2}\).
About the adaptive restart technique, the restarting criterion follows
here \(\xi _k\) is an approximation of \(\nabla \log \rho _k\). The choice of \(\alpha _k\) is similar, depending on the convexity of \(E(\rho )\) w.r.t. Stein metric.
About the adaptive restart technique, the restarting criterion follows
Implementation Details in the Numerical Experiments
In this section, we provide extra numerical experiments and elaborate on the implementation details in the numerical experiments.
1.1 Details in Subsection 6.1
We follow the same setting as [17], which is also adopted by [14, 15]. The dataset is split into \(80\%\) for training and \(20\%\) for testing. We use the stochastic gradient and the mini-batch size is taken as 100. For MCMC, the number of particles is \(N=1000\); for other methods, the number of particles is \(N=100\). The BM method is not applied to SVGD in selecting the bandwidth.
The initial step sizes for the compared methods are given in Table 3, which are selected by grid search over \(1\times 10^i\) with \(i=-3,-4,\dots ,-9\). (For SVGD, we use the initial step size in [17].) The step size of SVGD is adjusted by Adagrad, which is same as [17]. For WNAG and WRes, the step size is give by \(\tau _l = \tau _0/l^{0.9}\) for \(l\ge 1\). The parameters for WNAG and Wnes are identical to [15] and [14]. For other methods, the step size is multiplied by 0.9 every 100 iterations. For methods under Kalman-Wasserstein metric, we require a smaller step size (around 1e-8) to make the algorithm converge. For all discrete-time algorithms of AIGs, we apply the restart technique. We record the cpu-time for each method in Table 4. The computational cost of the BM method is much higher than the MED method because we need to evaluate the MMD of two particle systems several times in optimizing the subproblem. We may update the bandwidth using the BM method every 10 iterations to deal with the high computation cost of the BM method. On the other hand, using the MED method for bandwidth, the computational cost of S-AIG is much higher than other methods. This results from the multiple times of computation of particle interacting in updating \(X_k^i\) and \(V_k^i\).
1.2 Details in Subsection 6.2
We follow the setting of Bayesian neural network as [38]. The kernel bandwidth is adjusted by the MED method. We list the number of epochs and the batch size for each datasets in Table 5. For each dataset, we use \(90\%\) of samples as the training set and \(10\%\) of samples as the test set. The step size of SVGD is adjusted by Adagrad. For W-GF and W-AIG , the step size is multiplied by 0.64 every 1/10 of total epochs. We select the initial step size by grid search over \(\{1,2,5\}\times 10^i\) with \(i=-3,-4,\dots ,-7\) to ensure the best performance of compared methods. We list the initial step sizes for each dataset in Table 6. For W-AIG, we apply the adaptive restart.