Particle-based energetic variational inference

Wang, Yiwei; Chen, Jiuhai; Liu, Chun; Kang, Lulu

doi:10.1007/s11222-021-10009-7

Particle-based energetic variational inference

Published: 17 April 2021

Volume 31, article number 34, (2021)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Yiwei Wang¹,
Jiuhai Chen¹,
Chun Liu¹ &
…
Lulu Kang ORCID: orcid.org/0000-0002-6000-3436¹

834 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI objective function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing particle-based variational inference (ParVI) methods, including the popular Stein variational gradient descent (SVGD). More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or “Approximation-then-Variation” for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level, and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 3

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

Article 27 June 2023

Dai Hai Nguyen & Tetsuya Sakurai

A Particle-Evolving Method for Approximating the Optimal Transport Plan

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Article 09 August 2022

Zuheng Xu & Trevor Campbell

Notes

Available from https://github.com/dilinwang820/Stein-Variational-Gradient-Descent.

References

Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 28(1), 131–142 (1966)
MathSciNet MATH Google Scholar
Ambrosio, L., Lisini, S., Savaré, G.: Stability of flows associated to gradient vector fields and convergence of iterated transport maps. Manuscr. Math. 121(1), 1–50 (2006)
Article MathSciNet MATH Google Scholar
Arbel, M., Korba, A., Salim, A., Gretton, A.: Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems, pp. 6484–6494 (2019)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet MATH Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Article MathSciNet Google Scholar
Carrillo, J.A., Lisini, S.: On the asymptotic behavior of the gradient flow of a polyconvex functional. Nonlinear Partial Differ. Equ. Hyperbolic Wave Phenom. 526, 37–51 (2010)
Article MathSciNet MATH Google Scholar
Carrillo, J.A., Düring, B., Matthes, D., McCormick, D.S.: A Lagrangian scheme for the solution of nonlinear diffusion equations using moving simplex meshes. J. Sci. Comput. 75(3), 1463–1499 (2018)
Article MathSciNet MATH Google Scholar
Carrillo, J.A., Craig, K., Patacchini, F.S.: A blob method for diffusion. Calc. Var. Partial. Differ. Equ. 58(2), 53 (2019)
Article MathSciNet MATH Google Scholar
Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)
MathSciNet Google Scholar
Chen, C., Zhang, R., Wang, W., Li, B., Chen, L.: A unified particle-optimization framework for scalable Bayesian sampling (2018). arXiv preprint arXiv:1805.11659
Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., Ghattas, O.: Projected stein variational Newton: a fast and scalable Bayesian inference method in high dimensions (2019). arXiv preprint arXiv:1901.08659
Dai, B., He, N., Dai, H., Song, L.: Provable Bayesian inference via particle mirror descent. In: Artificial Intelligence and Statistics, pp. 985–994 (2016)
Degond, P., Mustieles, F.J.: A deterministic approximation of diffusion equations using particles. SIAM J. Sci. Comput. 11(2), 293–310 (1990)
Article MathSciNet MATH Google Scholar
Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., Scheichl, R.: A Stein variational Newton method. In: Advances in Neural Information Processing Systems, pp. 9169–9179 (2018)
Du, Q., Feng, X.: The phase field method for geometric moving interfaces and their numerical approximations (2019). arXiv preprint arXiv:1902.04924
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
Article MathSciNet Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
El Moselhy, T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231(23), 7815–7850 (2012)
Article MathSciNet MATH Google Scholar
Evans, L.C., Savin, O., Gangbo, W.: Diffeomorphisms and nonlinear heat flows. SIAM J. Math. Anal. 37(3), 737–751 (2005)
Article MathSciNet MATH Google Scholar
Francois, D., Wertz, V., Verleysen, M., et al.: About the locality of kernels in high-dimensional spaces. In: International Symposium on Applied Stochastic Models and Data Analysis, pp. 238–245. Citeseer (2005)
Frogner, C., Poggio, T.: Approximate inference with Wasserstein gradient flows (2018). arXiv preprint arXiv:1806.04542
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. Chapman and Hall/CRC, Boca Raton (2013)
Book MATH Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Article MATH Google Scholar
Gershman, S.J., Hoffman, M.D., Blei, D.M.: Nonparametric variational inference. In: Proceedings of the 29th International Conference on Machine Learning, pp. 235–242 (2012)
Giga, M.H., Kirshtein, A., Liu, C.: Variational modeling and complex fluids. Handbook of Mathematical Analysis in Mechanics of Viscous Fluids, pp. 1–41 (2017)
Gonzalez, O., Stuart, A.M.: A First Course in Continuum Mechanics. Cambridge University Press, Cambridge (2008)
MATH Google Scholar
Haario, H., Saksman, E., Tamminen, J.: Adaptive proposal distribution for random walk Metropolis algorithm. Comput. Stat. 14(3), 375–396 (1999)
Article MATH Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Article MathSciNet MATH Google Scholar
Hohenberg, P.C., Halperin, B.I.: Theory of dynamic critical phenomena. Rev. Mod. Phys. 49(3), 435 (1977)
Article Google Scholar
Iserles, A.: A first course in the numerical analysis of differential equations. No. 44 in Cambridge Texts in Applied Mathematics. Cambridge University Press, New York (2009)
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Article MathSciNet MATH Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article MATH Google Scholar
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
Lacombe, G., Mas-Gallic, S.: Presentation and analysis of a diffusion-velocity method. In: ESAIM: Proceedings, vol. 7, pp. 225–233. EDP Sciences (1999)
Li, L., Liu, J.G., Liu, Z., Lu, J.: A stochastic version of Stein variational gradient descent for efficient sampling (2019). arXiv preprint arXiv:1902.03394
Liu, C.: An introduction of elastic complex fluids: an energetic variational approach. In: Multi-Scale Phenomena in Complex Fluids: Modeling, Analysis and Numerical Simulation, pp. 286–337. World Scientific (2009)
Liu, Q.: Stein variational gradient descent as gradient flow. In: Advances in Neural Information Processing Systems, pp. 3115–3123 (2017)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose Bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386 (2016)
Liu, C., Wang, Y.: On Lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach. J. Comput. Phys. 417, 109566 (2020a)
Article MathSciNet MATH Google Scholar
Liu, C., Wang, Y.: A variational Lagrangian scheme for a phase field model: a discrete energetic variational approach. SIAM J. Sci. Comput. 42(6), B1541–B1569 (2020b)
Article MathSciNet MATH Google Scholar
Liu, C., Zhu, J.: Riemannian Stein variational gradient descent for Bayesian inference. In: 32nd AAAI Conference on Artificial Intelligence (2018)
Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Understanding and accelerating particle-based variational inference. In: International Conference on Machine Learning, pp. 4082–4092 (2019)
Lu, J., Lu, Y., Nolen, J.: Scaling limit of the Stein variational gradient descent: the mean field regime. SIAM J. Math. Anal. 51(2), 648–671 (2019)
Article MathSciNet MATH Google Scholar
MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Google Scholar
Matthes, D., Plazotta, S.: A variational formulation of the BDF2 method for metric gradient flows. ESAIM: Math. Model. Numer. Anal. 53(1), 145–172 (2019)
Article MathSciNet MATH Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Article MATH Google Scholar
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41–48. IEEE (1999)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
MATH Google Scholar
Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada (1993)
Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)
Onsager, L.: Reciprocal relations in irreversible processes. I. Phys. Rev. 37(4), 405 (1931a)
Article MATH Google Scholar
Onsager, L.: Reciprocal relations in irreversible processes. II. Phys. Rev. 38(12), 2265 (1931b)
Article MATH Google Scholar
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference (2019). arXiv preprint arXiv:1912.02762
Parisi, G.: Correlation functions and computer simulations. Nucl. Phys. B 180(3), 378–384 (1981)
Article MathSciNet Google Scholar
Rayleigh, L.: Note on the numerical calculation of the roots of fluctuating functions. Proc. Lond. Math. Soc. 1(1), 119–124 (1873)
Article MathSciNet MATH Google Scholar
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows (2015). arXiv preprint arXiv:1505.05770
Roberts, G.O., Tweedie, R.L., et al.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Article MathSciNet MATH Google Scholar
Rossky, P.J., Doll, J.D., Friedman, H.L.: Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69(10), 4628–4633 (1978)
Article Google Scholar
Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational inference: bridging the gap. In: International Conference on Machine Learning, pp. 1218–1226 (2015)
Salman, H., Yadollahpour, P., Fletcher, T., Batmanghelich, K.: Deep diffeomorphic normalizing flows (2018). arXiv preprint arXiv:1810.03256
Santambrogio, F.: $\{$Euclidean, metric, and Wasserstein$\}$ gradient flows: an overview. Bull. Math. Sci 7(1), 87–154 (2017)
Article MathSciNet MATH Google Scholar
Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20(1), 31–82 (2019)
MathSciNet MATH Google Scholar
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)
Article MathSciNet MATH Google Scholar
Tabak, E.G., Vanden-Eijnden, E., et al.: Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci. 8(1), 217–233 (2010)
Article MathSciNet MATH Google Scholar
Temam, R., Miranville, A.: Mathematical Modeling in Continuum Mechanics. Cambridge University Press, Cambridge (2005)
Book MATH Google Scholar
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)
MATH Google Scholar
Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
MATH Google Scholar
Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Advances in Neural Information Processing Systems, pp. 7834–7844 (2019)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning, pp. 681–688 (2011)

Download references

Acknowledgements

Y. Wang and C. Liu are partially supported by the National Science Foundation Grant DMS-1759536. L. Kang is partially supported by the National Science Foundation Grant DMS-1916467.

Author information

Authors and Affiliations

Department of Applied Mathematics, Illinois Institute of Technology, Chicago, IL, USA
Yiwei Wang, Jiuhai Chen, Chun Liu & Lulu Kang

Authors

Yiwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiuhai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lulu Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lulu Kang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Energetic variational approach

In this appendix, we gives a brief introduction to the energetic variational approach. We refer interested readers to Liu (2009) and Giga et al. (2017) for a more comprehensive description.

As mentioned previously, the energetic variational approach provides a paradigm to determine the dynamics of a dissipative system from a prescribed energy-dissipation law, which shifts the main task in the modeling of a dynamic system to the construction of the energy-dissipation law. In physics, an energy-dissipation law, yielded by the first and second law of thermodynamics (Giga et al. 2017), is often given by

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} ({\mathcal {K}} + {\mathcal {F}}) [{\varvec{\phi }}] = - 2 {\mathcal {D}}[{\varvec{\phi }}, \dot{\varvec{\phi }}], \end{aligned}$$

(A.1)

where ${\varvec{\phi }}$ is the state variable, ${\mathcal {K}}$ is the kinetic energy, ${\mathcal {F}}$ is the Helmholtz free energy, and $2 {\mathcal {D}}$ is the rate of energy dissipation. If ${\mathcal {K}} = 0$, one can view (A.1) as a generalization of gradient flow (Hohenberg and Halperin 1977).

The Least Action Principle states that the equation of motion for a Hamiltonian system can be derived from the variation of the action functional ${\mathcal {A}} = \int _{0}^T ({\mathcal {K}} - {\mathcal {F}})\mathrm {d}{\varvec{x}}$ with respect to ${\varvec{\phi }}({\varvec{z}}, t)$ (the trajectory), i.e.,

$$\begin{aligned} \begin{aligned} \delta {\mathcal {A}}&= \lim _{\epsilon \rightarrow 0} \frac{{\mathcal {A}}[{\varvec{\phi }} + \epsilon \delta {\varvec{\psi }}] - {\mathcal {A}}[{\varvec{\phi }}] }{\epsilon } \\&= \int _{0}^T \int _{{\mathcal {X}}^t} (f_{\mathrm{inertial}} - f_{\mathrm{conv}}) \cdot \delta {\varvec{\psi }} \mathrm {d}{\varvec{x}}\mathrm {d}t. \\ \end{aligned} \end{aligned}$$

This procedure yields the conservative forces of the system, that is $ (f_{\mathrm{inertial}} - f_{\mathrm{conv}}) = \frac{\delta {\mathcal {A}}}{\delta {\varvec{\phi }}}$. Meanwhile, according to the MDP, the dissipative force can be obtained by minimizing the dissipation functional with respect to the “rate” $\dot{\varvec{\phi }}$, i.e.,

$$\begin{aligned} \delta {\mathcal {D}} = \lim _{\epsilon \rightarrow 0} \frac{{\mathcal {D}}[\dot{\varvec{\phi }} + \epsilon \delta {\varvec{\psi }}] - {\mathcal {D}}[\dot{\varvec{\phi }}] }{\epsilon } = \int _{{\mathcal {X}}^t} f_{\mathrm{diss}} \cdot \delta \psi \mathrm {d}{\varvec{x}}, \end{aligned}$$

or $f_{\mathrm{diss}} = \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}}$. According to the Newton’s second law ($F = ma$), we have the force balance condition $f_{\mathrm{inertial}} = f_{\mathrm{conv}} + f_{\mathrm{diss}}$ ($f_{\mathrm{inertial}}$ plays role of ma), which defines the dynamics of the system

$$\begin{aligned} \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}} = \frac{\delta {\mathcal {A}}}{\delta {\varvec{\phi }}}. \end{aligned}$$

(A.2)

In the case that ${\mathcal {K}} = 0$, we have

$$\begin{aligned} \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}} = - \frac{\delta {\mathcal {F}}}{\delta {\varvec{\phi }}}, \end{aligned}$$

(A.3)

Notice that the free energy is decreasing with respect to the time when ${\mathcal {K}} = 0$. As an analogy, if we consider VI objective functional as ${\mathcal {F}}$, (A.1) gives a continuous mechanism to decrease the free energy, and (A.2) or (A.3) gives the equation ${\varvec{\phi }}({\varvec{z}}, t)$.

Derivation of the transport equation

The transport equation can be derived from the conservation of probability mass directly. Let

$$\begin{aligned} {\mathsf{F}}({\varvec{z}}, t) = \nabla _{{\varvec{z}}} \varvec{\phi }({\varvec{z}}, t) \end{aligned}$$

(B.1)

be the deformation tensor associate with the flow map $\phi ({\varvec{z}}, t)$, i.e., the Jacobian matrix of ${\varvec{\phi }}$, then due to the conservation of probability mass, we have

$$\begin{aligned} \begin{aligned} 0&= \frac{\mathrm {d}}{\mathrm {d}t} \int _{{\mathcal {X}}^t} \rho ({\varvec{x}}, t) \mathrm {d}{\varvec{x}}= \frac{\mathrm {d}}{\mathrm {d}t} \int _{{\mathcal {X}}^0} \rho (\phi ({\varvec{z}}, t), t) \det ({\mathsf{F}}({\varvec{z}}, t)) \mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^0} \left( {\dot{\rho }} + \nabla \rho \cdot {\mathbf {u}}+ \rho ({\mathsf{F}}^{-\mathrm T} : \frac{\mathrm {d}{\mathsf{F}}}{\mathrm {d}t}) \right) \det {\mathsf{F}}\mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^t} \left( {\dot{\rho }} + \nabla \rho \cdot {\mathbf {u}}+ \rho (\nabla \cdot {\mathbf {u}}) \right) \mathrm {d}{\varvec{x}}= 0, \\ \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} {\dot{\rho }} + \nabla \cdot (\rho {\mathbf {u}}) = 0. \end{aligned}$$

Here the operation “ : ” between two matrix, $\mathsf{A} : \mathsf{B} = \sum _{i} \sum _{j} A_{ij} B_{ij}$, is the Frobenius inner product between two matrices $\mathsf{A}, \mathsf{B} \in {\mathbb {R}}^{n \times m}$.

Computation of Equation (2.10)

In this part, we give a detailed derivation of the variation of $\mathrm {KL}(\rho _{[\varvec{\phi }]} | \rho ^{*})$ with respect to the flow map $\varvec{\phi }({\varvec{z}}): {\mathcal {X}}^0 \rightarrow {\mathcal {X}}^t$. Consider a small perturbation of $\varvec{\phi }$

$$\begin{aligned} \varvec{\phi }^{\epsilon }({\varvec{z}}):= \varvec{\phi }({\varvec{z}}) + \epsilon \varvec{\psi }({\varvec{z}}), \end{aligned}$$

where $\varvec{\psi }({\varvec{z}}) = \widetilde{\varvec{\psi }}(\varvec{\phi }({\varvec{z}}))$ is a smooth map satisfying

$$\begin{aligned} \widetilde{\varvec{\psi }} \cdot \varvec{\nu } = 0, \quad \text {on} ~~ \partial {\mathcal {X}}^t \end{aligned}$$

with $\varvec{\nu }$ be the outward pointing unit normal on the boundary, $\partial {\mathcal {X}}^t$. Thus, $\widetilde{\varvec{\psi }}=\varvec{\psi }(\varvec{\phi }^{-1}({\varvec{x}}))$ and the above condition indicates that $\widetilde{\varvec{\psi }}$ is diffused to zero at the boundary of ${\mathcal {X}}^t$. For ${\mathcal {X}}^0 = {\mathcal {X}}^d = {\mathbb {R}}^d$, $\widetilde{\varvec{\psi }} \in C_0^{\infty } ({\mathbb {R}}^d)$. We denote ${\mathsf{F}}$ as the Jacobian matrix of $\varvec{\phi }$, i.e., $F_{ij}=\frac{\partial \phi _i}{\partial z_j}$, and ${\mathsf{F}}^{\epsilon }$ is the Jacobian matrix of $\varvec{\phi }^{\epsilon }$, i.e.,

$$\begin{aligned} {\mathsf{F}}^{\epsilon }:= \nabla _{{\varvec{z}}} \varvec{\phi } + \epsilon \nabla _{z} \varvec{\psi }. \end{aligned}$$

Then we have

$$\begin{aligned} \begin{aligned}&\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon = 0} \mathrm {KL}(\rho _{[\varvec{\phi }^{\epsilon }]} || \rho ^{*}) \\&\quad = \frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \int _{{\mathcal {X}}^0} \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} \ln \left( \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} \right) \det ({\mathsf{F}}^{\epsilon })\mathrm {d}{\varvec{z}}\right. \\&\qquad \left. \quad \quad \quad \quad + \int _{{\mathcal {X}}^0} \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} V( \varvec{\phi }^{\epsilon }({\varvec{z}})) ~\det ({\mathsf{F}}^{\epsilon }) \mathrm {d}{\varvec{z}}\right) \\&\quad = \frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \int _{{\mathcal {X}}^0} \rho _0\ln \rho _0-\rho _0\ln \det {\mathsf{F}}^{\epsilon } + \rho _0 V( \varvec{\phi }^{\epsilon }({\varvec{z}})) \mathrm {d}{\varvec{z}}\right) \\&\quad =\int _{{\mathcal {X}}^0} -\rho _0\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \ln \det ({\mathsf{F}}^{\epsilon }) + V( \varvec{\phi }^{\epsilon }({\varvec{z}}))\right) \mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^0} - \rho _0 ({\mathsf{F}}^{-\top }:\nabla _{{\varvec{z}}} \varvec{\psi }) + (\nabla _{{\varvec{x}}} V \cdot \varvec{\psi }) \rho _0 \mathrm {d}{\varvec{z}}. \end{aligned} \end{aligned}$$

(C.1)

For two matrices of the same size, define $\mathsf{A}:\mathsf{B}=\sum _i\sum _j A_{ij}B_{ij}=\mathrm {tr}(\mathsf{A}^\top \mathsf{B})$. Since

$$\begin{aligned} \frac{\mathrm {d}\det ({\mathsf{F}}^{\epsilon })}{\mathrm {d}\epsilon }=\det ({\mathsf{F}}^{\epsilon })\mathrm {tr}\left[ ({\mathsf{F}}^{\epsilon })^{-1}\frac{\mathrm {d}{\mathsf{F}}^{\epsilon }}{\mathrm {d}\epsilon }\right] , \end{aligned}$$

we have

$$\begin{aligned} \frac{\mathrm {d}\det ({\mathsf{F}}^{\epsilon })}{\mathrm {d}\epsilon }\Big |_{\epsilon =0 }&=\det ({\mathsf{F}})\mathrm {tr}\left[ {\mathsf{F}}^{-1}\nabla _{{\varvec{z}}}\varvec{\psi }\right] \\&=\det ({\mathsf{F}})({\mathsf{F}}^{-\top }:\nabla _{{\varvec{z}}}\varvec{\psi }). \end{aligned}$$

Hence, we have the last result in (C.1).

Based on the definition of $\varvec{\phi }$, we have the following.

$$\begin{aligned}&{\varvec{x}}=\varvec{\phi }({\varvec{z}}), \quad {\varvec{z}}=\varvec{\phi }^{-1}({\varvec{x}})\\&\varvec{\psi }({\varvec{z}})=\widetilde{\varvec{\psi }}(\varvec{\phi }({\varvec{z}}))=\widetilde{\varvec{\psi }}({\varvec{x}})\\&\rho ({\varvec{x}})=\rho _0(\varvec{\phi }^{-1}({\varvec{x}}))\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))\\&\rho _0({\varvec{z}})=\rho _0(\varvec{\phi }^{-1}({\varvec{x}}))=\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))}. \end{aligned}$$

The second summand of (C.1) becomes

$$\begin{aligned}&\int _{{\mathcal {X}}_0}\rho _0\left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \mathrm {d}{\varvec{z}}\\&\quad =\int _{{\mathcal {X}}_t}\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))} \left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \mathrm {d}\varvec{\phi }^{-1}({\varvec{x}})\\&\quad =\int _{{\mathcal {X}}_t}\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))} \left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}})) \mathrm {d}{\varvec{x}}\\&\quad =\int _{{\mathcal {X}}_t}\rho ({\varvec{x}})\left[ (\nabla _{{\varvec{x}}}V)\cdot \varvec{\psi }\right] \mathrm {d}{\varvec{x}}. \end{aligned}$$

Now, we investigate the first summand in (C.1). Based on the definition of $\widetilde{\varvec{\psi }}$, we can see that

$$\begin{aligned} (\nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }})_{i,j}&=\sum _{k=1}^d \frac{\partial \psi _i}{\partial z_k}\cdot \frac{\partial z_k}{\partial x_j}=\sum _{k=1}^d (\nabla _{{\varvec{z}}}\varvec{\psi })_{i,k}(\nabla _{{\varvec{x}}}\varvec{\phi }^{-1})_{k,j}\\ \nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }}&=\nabla _{{\varvec{z}}}\varvec{\psi }\left[ \nabla _{{\varvec{x}}}\varvec{\phi }^{-1} \right] ^\top =(\nabla _{{\varvec{z}}}\varvec{\psi }) {\mathsf{F}}^{-\top }, \end{aligned}$$

because ${\mathsf{F}}=\nabla _{{\varvec{z}}}\varvec{\phi }(z)=\left( \frac{\partial x_i}{\partial z_j}\right) _{i,j}$, ${\mathsf{F}}^{-1}=\left( \frac{\partial z_i}{\partial x_j}\right) _{i,j}=(\nabla _{{\varvec{x}}} \varvec{\phi }^{-1}({\varvec{x}}))^{-1}$. Divergence of $\widetilde{\varvec{\psi }}(\varvec{x})$ is

$$\begin{aligned}&\nabla _{{\varvec{x}}} \cdot \widetilde{\varvec{\psi }}(\varvec{x})=\sum _{i=1}^d \frac{\partial {\tilde{\psi }}_i}{\partial x_i}=\mathrm {tr}(\text {Jacobian of }\widetilde{\varvec{\psi }})\\&\quad =\mathrm {tr}(\nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }})=\mathrm {tr}((\nabla _{{\varvec{z}}}\varvec{\psi }) {\mathsf{F}}^{-\top })=\mathrm {tr}({\mathsf{F}}^{-1}\nabla _{{\varvec{z}}}\varvec{\psi }). \end{aligned}$$

Therefore, the first summand in (C.1) becomes,

$$\begin{aligned} -\int _{{\mathcal {X}}_0}\rho _0 ({\mathsf{F}}^{-\top } : \nabla _{{\varvec{z}}} \varvec{\psi })\mathrm {d}z =-\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}. \end{aligned}$$

Following the corollary of divergence theorem,

$$\begin{aligned}&\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}+ \int _{{\mathcal {X}}_t}\widetilde{\varvec{\psi }}^\top (\nabla _{{\varvec{x}}}\rho ({\varvec{x}}))\mathrm {d}{\varvec{x}}\\&\quad =\oint _{\partial {\mathcal {X}}_t} \rho ({\varvec{x}})(\widetilde{\varvec{\psi }}\cdot \varvec{\nu })\mathrm {d}S=0, \end{aligned}$$

because the boundary condition $\widetilde{\varvec{\psi }}\cdot \varvec{\nu }=0$ on $\partial {\mathcal {X}}_t$. So

$$\begin{aligned}&-\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}\\&\quad =\int _{{\mathcal {X}}_t}\widetilde{\varvec{\psi }}^\top (\nabla _{{\varvec{x}}}\rho ({\varvec{x}}))\mathrm {d}{\varvec{x}}=\int _{{\mathcal {X}}_t}\nabla _{{\varvec{x}}}\rho ({\varvec{x}})\cdot \widetilde{\varvec{\psi }}\mathrm {d}{\varvec{x}}. \end{aligned}$$

Therefore, in ${\mathcal {X}}^t$, by performing integration by parts, we have

$$\begin{aligned} \begin{aligned}&\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon = 0} \mathrm {KL}(\rho _{[\varvec{\phi }^{\epsilon }]} || \rho ^{*}) \\&\quad = \int _{{\mathcal {X}}^t} - \rho _{[\varvec{\phi }]} (\nabla _{{\varvec{x}}} \cdot \widetilde{ \varvec{\psi }}) + \rho \nabla V \cdot \widetilde{\varvec{\psi }} \mathrm {d}{\varvec{x}}\\&\quad = \int _{{\mathcal {X}}^t} (\nabla \rho + \rho \nabla V) \cdot \widetilde{\varvec{\psi }} \mathrm {d}{\varvec{x}}, \\ \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} \frac{\delta \mathrm {KL}(\rho _{[\varvec{\phi }]} || \rho ^{*})}{\delta \varvec{\phi }} = \nabla \rho + \rho \nabla V \end{aligned}$$

(C.2)

Recall $V = - \ln \rho ^{*}$. One can notice that if F is an identity matrix, the result in (C.1) can be written as

$$\begin{aligned} - {\mathbb {E}}_{{\varvec{z}}\sim \rho _0} [\mathrm {trace} (\nabla _{{\varvec{z}}} \varvec{\psi } + \nabla \ln \rho ^{*} \varvec{\psi }^{\mathrm{T}})], \end{aligned}$$

which is exactly the form given by the Stein operator in Liu and Wang (2016).

Proof of Theorem 1

Proof

Let ${\mathbf {X}}\in {\mathbb {R}}^D$ be vectorized $\{ {\varvec{x}}_i \}_{i = 1}^N$, that is

$$\begin{aligned} {\mathbf {X}}= (x_1^{(1)},\ldots x_N^{(1)}, \ldots x_1^{(d)} \ldots x_N^{(d)} ), \end{aligned}$$

where $D = N \times d$. Recall that $V({\varvec{x}}) = - \ln \rho ^{*}$. For a sufficient smooth target distribution $\rho ^{*}({\varvec{x}})$, it is easy to show that

$$\begin{aligned} {\mathcal {F}}_h ( \{ {\varvec{x}}_i \}_{i=1}^{N}) = \frac{1}{N} \sum _{i=1}^N \left( \ln \left( \frac{1}{N} \sum _{j=1}^N K({\varvec{x}}_i, {\varvec{x}}_j) \right) + V({\varvec{x}}_i) \right) \end{aligned}$$

is continuous, coercive and bounded from below as a function of ${\mathbf {X}}\in {\mathbb {R}}^D$. We denote ${\mathcal {F}}_h ( \{ {\varvec{x}}_i \}_{i=1}^{N})$ by ${\mathcal {F}}_h ({\mathbf {X}})$.

For any given $\{ {\varvec{x}}_i^n \}_{i=1}^N$, recall

$$\begin{aligned} J_n({\mathbf {X}}) = \frac{1}{2 \tau } \Vert {\mathbf {X}}- {\mathbf {X}}^n\Vert ^2 + {\mathcal {F}}_h({\mathbf {X}}), \end{aligned}$$

where $\Vert \cdot \Vert ^2_{{\mathbf {X}}}$ is a norm for ${\mathbf {X}}$, defined by

$$\begin{aligned} \Vert {\mathbf {X}}- {\mathbf {X}}^n\Vert ^2_{{\mathbf {X}}} = \frac{1}{N} \sum _{i=1}^N \Vert {\varvec{x}}_i - {\varvec{x}}_i^n \Vert ^2. \end{aligned}$$

Since

$$\begin{aligned} {\mathcal {S}} = \{ J(\varvec{{\mathbf {X}}}) \le J(\varvec{{\mathbf {X}}}^n) \} \end{aligned}$$

is a non-empty, bounded, and closed set, by the coerciveness and continuity of ${\mathcal {F}}_h({\mathbf {X}})$, $J_n({\mathbf {X}})$ admits a global minimizer ${\mathbf {X}}^{n+1}$ in ${\mathcal {S}}$. Since ${\mathbf {X}}^{n+1}$ is a global minimizer of $J({\mathbf {X}})$, we have

$$\begin{aligned} \frac{1}{2 \tau } \Vert {\mathbf {X}}^{n+1} - {\mathbf {X}}^{n} \Vert ^2_{{\mathbf {X}}} + {\mathcal {F}}_h({\mathbf {X}}^{n+1}) \le {\mathcal {F}}_h({\mathbf {X}}^{n}), \end{aligned}$$

which gives us equation (3.19).

For series $\{ {\mathbf {X}}^n \}$, since

$$\begin{aligned} \Vert {\mathbf {X}}^{k} - {\mathbf {X}}^{k-1} \Vert ^2_{{\mathbf {X}}} \le 2 \tau ({\mathcal {F}}_h({\mathbf {X}}^{k-1}) - {\mathcal {F}}_h({\mathbf {X}}^{k})), \end{aligned}$$

we have

$$\begin{aligned} \sum _{k=1}^n \Vert {\mathbf {X}}^{k} - {\mathbf {X}}^{k-1} \Vert ^2_{{\mathbf {X}}} \le 2 \tau ({\mathcal {F}}_h({\mathbf {X}}^{0}) - {\mathcal {F}}_h({\mathbf {X}}^{n})) \le C, \end{aligned}$$

for some constant C that is independent with n. Hence

$$\begin{aligned} \lim _{n\rightarrow \infty } \Vert {\mathbf {X}}^{n} - {\mathbf {X}}^{n-1} \Vert _{{\mathbf {X}}} = 0, \end{aligned}$$

which indicates the convergence of $\{ {\mathbf {X}}^n \}$. Moreover, since

$$\begin{aligned} {\mathbf {X}}^{n} = {\mathbf {X}}^{n-1} - \tau \nabla _{{\mathbf {X}}} {\mathcal {F}}_h({\mathbf {X}}^{n}), \end{aligned}$$

we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \nabla _{{\mathbf {X}}} {\mathcal {F}}_h({\mathbf {X}}^{n}) = 0, \end{aligned}$$

so $\{{\mathbf {X}}^n\}$ converges to a stationary point of ${\mathcal {F}}_h({\mathbf {X}})$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Chen, J., Liu, C. et al. Particle-based energetic variational inference. Stat Comput 31, 34 (2021). https://doi.org/10.1007/s11222-021-10009-7

Download citation

Received: 02 June 2020
Accepted: 23 March 2021
Published: 17 April 2021
DOI: https://doi.org/10.1007/s11222-021-10009-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Particle-based energetic variational inference

Abstract

Access this article

Similar content being viewed by others

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

A Particle-Evolving Method for Approximating the Optimal Transport Plan

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Energetic variational approach

Derivation of the transport equation

Computation of Equation (2.10)

Proof of Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Particle-based energetic variational inference

Abstract

Access this article

Similar content being viewed by others

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

A Particle-Evolving Method for Approximating the Optimal Transport Plan

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Energetic variational approach

Derivation of the transport equation

Computation of Equation (2.10)

Proof of Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation