Skip to main content
Log in

Particle-based energetic variational inference

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI objective function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing particle-based variational inference (ParVI) methods, including the popular Stein variational gradient descent (SVGD). More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or “Approximation-then-Variation” for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level, and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Available from https://github.com/dilinwang820/Stein-Variational-Gradient-Descent.

References

  • Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 28(1), 131–142 (1966)

    MathSciNet  MATH  Google Scholar 

  • Ambrosio, L., Lisini, S., Savaré, G.: Stability of flows associated to gradient vector fields and convergence of iterated transport maps. Manuscr. Math. 121(1), 1–50 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Arbel, M., Korba, A., Salim, A., Gretton, A.: Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems, pp. 6484–6494 (2019)

  • Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    MATH  Google Scholar 

  • Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)

    Article  MathSciNet  Google Scholar 

  • Carrillo, J.A., Lisini, S.: On the asymptotic behavior of the gradient flow of a polyconvex functional. Nonlinear Partial Differ. Equ. Hyperbolic Wave Phenom. 526, 37–51 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Carrillo, J.A., Düring, B., Matthes, D., McCormick, D.S.: A Lagrangian scheme for the solution of nonlinear diffusion equations using moving simplex meshes. J. Sci. Comput. 75(3), 1463–1499 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Carrillo, J.A., Craig, K., Patacchini, F.S.: A blob method for diffusion. Calc. Var. Partial. Differ. Equ. 58(2), 53 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)

    MathSciNet  Google Scholar 

  • Chen, C., Zhang, R., Wang, W., Li, B., Chen, L.: A unified particle-optimization framework for scalable Bayesian sampling (2018). arXiv preprint arXiv:1805.11659

  • Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., Ghattas, O.: Projected stein variational Newton: a fast and scalable Bayesian inference method in high dimensions (2019). arXiv preprint arXiv:1901.08659

  • Dai, B., He, N., Dai, H., Song, L.: Provable Bayesian inference via particle mirror descent. In: Artificial Intelligence and Statistics, pp. 985–994 (2016)

  • Degond, P., Mustieles, F.J.: A deterministic approximation of diffusion equations using particles. SIAM J. Sci. Comput. 11(2), 293–310 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  • Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., Scheichl, R.: A Stein variational Newton method. In: Advances in Neural Information Processing Systems, pp. 9169–9179 (2018)

  • Du, Q., Feng, X.: The phase field method for geometric moving interfaces and their numerical approximations (2019). arXiv preprint arXiv:1902.04924

  • Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)

    Article  MathSciNet  Google Scholar 

  • Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  • El Moselhy, T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231(23), 7815–7850 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Evans, L.C., Savin, O., Gangbo, W.: Diffeomorphisms and nonlinear heat flows. SIAM J. Math. Anal. 37(3), 737–751 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Francois, D., Wertz, V., Verleysen, M., et al.: About the locality of kernels in high-dimensional spaces. In: International Symposium on Applied Stochastic Models and Data Analysis, pp. 238–245. Citeseer (2005)

  • Frogner, C., Poggio, T.: Approximate inference with Wasserstein gradient flows (2018). arXiv preprint arXiv:1806.04542

  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. Chapman and Hall/CRC, Boca Raton (2013)

    Book  MATH  Google Scholar 

  • Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)

    Article  MATH  Google Scholar 

  • Gershman, S.J., Hoffman, M.D., Blei, D.M.: Nonparametric variational inference. In: Proceedings of the 29th International Conference on Machine Learning, pp. 235–242 (2012)

  • Giga, M.H., Kirshtein, A., Liu, C.: Variational modeling and complex fluids. Handbook of Mathematical Analysis in Mechanics of Viscous Fluids, pp. 1–41 (2017)

  • Gonzalez, O., Stuart, A.M.: A First Course in Continuum Mechanics. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  • Haario, H., Saksman, E., Tamminen, J.: Adaptive proposal distribution for random walk Metropolis algorithm. Comput. Stat. 14(3), 375–396 (1999)

    Article  MATH  Google Scholar 

  • Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  • Hohenberg, P.C., Halperin, B.I.: Theory of dynamic critical phenomena. Rev. Mod. Phys. 49(3), 435 (1977)

    Article  Google Scholar 

  • Iserles, A.: A first course in the numerical analysis of differential equations. No. 44 in Cambridge Texts in Applied Mathematics. Cambridge University Press, New York (2009)

  • Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  • Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)

    Article  MATH  Google Scholar 

  • Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)

  • Lacombe, G., Mas-Gallic, S.: Presentation and analysis of a diffusion-velocity method. In: ESAIM: Proceedings, vol. 7, pp. 225–233. EDP Sciences (1999)

  • Li, L., Liu, J.G., Liu, Z., Lu, J.: A stochastic version of Stein variational gradient descent for efficient sampling (2019). arXiv preprint arXiv:1902.03394

  • Liu, C.: An introduction of elastic complex fluids: an energetic variational approach. In: Multi-Scale Phenomena in Complex Fluids: Modeling, Analysis and Numerical Simulation, pp. 286–337. World Scientific (2009)

  • Liu, Q.: Stein variational gradient descent as gradient flow. In: Advances in Neural Information Processing Systems, pp. 3115–3123 (2017)

  • Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose Bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386 (2016)

  • Liu, C., Wang, Y.: On Lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach. J. Comput. Phys. 417, 109566 (2020a)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, C., Wang, Y.: A variational Lagrangian scheme for a phase field model: a discrete energetic variational approach. SIAM J. Sci. Comput. 42(6), B1541–B1569 (2020b)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, C., Zhu, J.: Riemannian Stein variational gradient descent for Bayesian inference. In: 32nd AAAI Conference on Artificial Intelligence (2018)

  • Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Understanding and accelerating particle-based variational inference. In: International Conference on Machine Learning, pp. 4082–4092 (2019)

  • Lu, J., Lu, Y., Nolen, J.: Scaling limit of the Stein variational gradient descent: the mean field regime. SIAM J. Math. Anal. 51(2), 648–671 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  • Matthes, D., Plazotta, S.: A variational formulation of the BDF2 method for metric gradient flows. ESAIM: Math. Model. Numer. Anal. 53(1), 145–172 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)

    Article  MATH  Google Scholar 

  • Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41–48. IEEE (1999)

  • Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  • Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada (1993)

  • Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)

  • Onsager, L.: Reciprocal relations in irreversible processes. I. Phys. Rev. 37(4), 405 (1931a)

    Article  MATH  Google Scholar 

  • Onsager, L.: Reciprocal relations in irreversible processes. II. Phys. Rev. 38(12), 2265 (1931b)

    Article  MATH  Google Scholar 

  • Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference (2019). arXiv preprint arXiv:1912.02762

  • Parisi, G.: Correlation functions and computer simulations. Nucl. Phys. B 180(3), 378–384 (1981)

    Article  MathSciNet  Google Scholar 

  • Rayleigh, L.: Note on the numerical calculation of the roots of fluctuating functions. Proc. Lond. Math. Soc. 1(1), 119–124 (1873)

    Article  MathSciNet  MATH  Google Scholar 

  • Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows (2015). arXiv preprint arXiv:1505.05770

  • Roberts, G.O., Tweedie, R.L., et al.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  • Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  • Rossky, P.J., Doll, J.D., Friedman, H.L.: Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69(10), 4628–4633 (1978)

    Article  Google Scholar 

  • Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational inference: bridging the gap. In: International Conference on Machine Learning, pp. 1218–1226 (2015)

  • Salman, H., Yadollahpour, P., Fletcher, T., Batmanghelich, K.: Deep diffeomorphic normalizing flows (2018). arXiv preprint arXiv:1810.03256

  • Santambrogio, F.: \(\{\)Euclidean, metric, and Wasserstein\(\}\) gradient flows: an overview. Bull. Math. Sci 7(1), 87–154 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20(1), 31–82 (2019)

    MathSciNet  MATH  Google Scholar 

  • Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Tabak, E.G., Vanden-Eijnden, E., et al.: Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci. 8(1), 217–233 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Temam, R., Miranville, A.: Mathematical Modeling in Continuum Mechanics. Cambridge University Press, Cambridge (2005)

    Book  MATH  Google Scholar 

  • Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)

    MATH  Google Scholar 

  • Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)

    MATH  Google Scholar 

  • Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Advances in Neural Information Processing Systems, pp. 7834–7844 (2019)

  • Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning, pp. 681–688 (2011)

Download references

Acknowledgements

Y. Wang and C. Liu are partially supported by the National Science Foundation Grant DMS-1759536. L. Kang is partially supported by the National Science Foundation Grant DMS-1916467.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lulu Kang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Energetic variational approach

In this appendix, we gives a brief introduction to the energetic variational approach. We refer interested readers to Liu (2009) and Giga et al. (2017) for a more comprehensive description.

As mentioned previously, the energetic variational approach provides a paradigm to determine the dynamics of a dissipative system from a prescribed energy-dissipation law, which shifts the main task in the modeling of a dynamic system to the construction of the energy-dissipation law. In physics, an energy-dissipation law, yielded by the first and second law of thermodynamics (Giga et al. 2017), is often given by

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} ({\mathcal {K}} + {\mathcal {F}}) [{\varvec{\phi }}] = - 2 {\mathcal {D}}[{\varvec{\phi }}, \dot{\varvec{\phi }}], \end{aligned}$$
(A.1)

where \({\varvec{\phi }}\) is the state variable, \({\mathcal {K}}\) is the kinetic energy, \({\mathcal {F}}\) is the Helmholtz free energy, and \(2 {\mathcal {D}}\) is the rate of energy dissipation. If \({\mathcal {K}} = 0\), one can view (A.1) as a generalization of gradient flow (Hohenberg and Halperin 1977).

The Least Action Principle states that the equation of motion for a Hamiltonian system can be derived from the variation of the action functional \({\mathcal {A}} = \int _{0}^T ({\mathcal {K}} - {\mathcal {F}})\mathrm {d}{\varvec{x}}\) with respect to \({\varvec{\phi }}({\varvec{z}}, t)\) (the trajectory), i.e.,

$$\begin{aligned} \begin{aligned} \delta {\mathcal {A}}&= \lim _{\epsilon \rightarrow 0} \frac{{\mathcal {A}}[{\varvec{\phi }} + \epsilon \delta {\varvec{\psi }}] - {\mathcal {A}}[{\varvec{\phi }}] }{\epsilon } \\&= \int _{0}^T \int _{{\mathcal {X}}^t} (f_{\mathrm{inertial}} - f_{\mathrm{conv}}) \cdot \delta {\varvec{\psi }} \mathrm {d}{\varvec{x}}\mathrm {d}t. \\ \end{aligned} \end{aligned}$$

This procedure yields the conservative forces of the system, that is \( (f_{\mathrm{inertial}} - f_{\mathrm{conv}}) = \frac{\delta {\mathcal {A}}}{\delta {\varvec{\phi }}}\). Meanwhile, according to the MDP, the dissipative force can be obtained by minimizing the dissipation functional with respect to the “rate” \(\dot{\varvec{\phi }}\), i.e.,

$$\begin{aligned} \delta {\mathcal {D}} = \lim _{\epsilon \rightarrow 0} \frac{{\mathcal {D}}[\dot{\varvec{\phi }} + \epsilon \delta {\varvec{\psi }}] - {\mathcal {D}}[\dot{\varvec{\phi }}] }{\epsilon } = \int _{{\mathcal {X}}^t} f_{\mathrm{diss}} \cdot \delta \psi \mathrm {d}{\varvec{x}}, \end{aligned}$$

or \(f_{\mathrm{diss}} = \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}}\). According to the Newton’s second law (\(F = ma\)), we have the force balance condition \(f_{\mathrm{inertial}} = f_{\mathrm{conv}} + f_{\mathrm{diss}}\) (\(f_{\mathrm{inertial}}\) plays role of ma), which defines the dynamics of the system

$$\begin{aligned} \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}} = \frac{\delta {\mathcal {A}}}{\delta {\varvec{\phi }}}. \end{aligned}$$
(A.2)

In the case that \({\mathcal {K}} = 0\), we have

$$\begin{aligned} \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}} = - \frac{\delta {\mathcal {F}}}{\delta {\varvec{\phi }}}, \end{aligned}$$
(A.3)

Notice that the free energy is decreasing with respect to the time when \({\mathcal {K}} = 0\). As an analogy, if we consider VI objective functional as \({\mathcal {F}}\), (A.1) gives a continuous mechanism to decrease the free energy, and (A.2) or (A.3) gives the equation \({\varvec{\phi }}({\varvec{z}}, t)\).

Derivation of the transport equation

The transport equation can be derived from the conservation of probability mass directly. Let

$$\begin{aligned} {\mathsf{F}}({\varvec{z}}, t) = \nabla _{{\varvec{z}}} \varvec{\phi }({\varvec{z}}, t) \end{aligned}$$
(B.1)

be the deformation tensor associate with the flow map \(\phi ({\varvec{z}}, t)\), i.e., the Jacobian matrix of \({\varvec{\phi }}\), then due to the conservation of probability mass, we have

$$\begin{aligned} \begin{aligned} 0&= \frac{\mathrm {d}}{\mathrm {d}t} \int _{{\mathcal {X}}^t} \rho ({\varvec{x}}, t) \mathrm {d}{\varvec{x}}= \frac{\mathrm {d}}{\mathrm {d}t} \int _{{\mathcal {X}}^0} \rho (\phi ({\varvec{z}}, t), t) \det ({\mathsf{F}}({\varvec{z}}, t)) \mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^0} \left( {\dot{\rho }} + \nabla \rho \cdot {\mathbf {u}}+ \rho ({\mathsf{F}}^{-\mathrm T} : \frac{\mathrm {d}{\mathsf{F}}}{\mathrm {d}t}) \right) \det {\mathsf{F}}\mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^t} \left( {\dot{\rho }} + \nabla \rho \cdot {\mathbf {u}}+ \rho (\nabla \cdot {\mathbf {u}}) \right) \mathrm {d}{\varvec{x}}= 0, \\ \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} {\dot{\rho }} + \nabla \cdot (\rho {\mathbf {u}}) = 0. \end{aligned}$$

Here the operation “ : ” between two matrix, \(\mathsf{A} : \mathsf{B} = \sum _{i} \sum _{j} A_{ij} B_{ij}\), is the Frobenius inner product between two matrices \(\mathsf{A}, \mathsf{B} \in {\mathbb {R}}^{n \times m}\).

Computation of Equation (2.10)

In this part, we give a detailed derivation of the variation of \(\mathrm {KL}(\rho _{[\varvec{\phi }]} | \rho ^{*})\) with respect to the flow map \(\varvec{\phi }({\varvec{z}}): {\mathcal {X}}^0 \rightarrow {\mathcal {X}}^t\). Consider a small perturbation of \(\varvec{\phi }\)

$$\begin{aligned} \varvec{\phi }^{\epsilon }({\varvec{z}}):= \varvec{\phi }({\varvec{z}}) + \epsilon \varvec{\psi }({\varvec{z}}), \end{aligned}$$

where \(\varvec{\psi }({\varvec{z}}) = \widetilde{\varvec{\psi }}(\varvec{\phi }({\varvec{z}}))\) is a smooth map satisfying

$$\begin{aligned} \widetilde{\varvec{\psi }} \cdot \varvec{\nu } = 0, \quad \text {on} ~~ \partial {\mathcal {X}}^t \end{aligned}$$

with \(\varvec{\nu }\) be the outward pointing unit normal on the boundary, \(\partial {\mathcal {X}}^t\). Thus, \(\widetilde{\varvec{\psi }}=\varvec{\psi }(\varvec{\phi }^{-1}({\varvec{x}}))\) and the above condition indicates that \(\widetilde{\varvec{\psi }}\) is diffused to zero at the boundary of \({\mathcal {X}}^t\). For \({\mathcal {X}}^0 = {\mathcal {X}}^d = {\mathbb {R}}^d\), \(\widetilde{\varvec{\psi }} \in C_0^{\infty } ({\mathbb {R}}^d)\). We denote \({\mathsf{F}}\) as the Jacobian matrix of \(\varvec{\phi }\), i.e., \(F_{ij}=\frac{\partial \phi _i}{\partial z_j}\), and \({\mathsf{F}}^{\epsilon }\) is the Jacobian matrix of \(\varvec{\phi }^{\epsilon }\), i.e.,

$$\begin{aligned} {\mathsf{F}}^{\epsilon }:= \nabla _{{\varvec{z}}} \varvec{\phi } + \epsilon \nabla _{z} \varvec{\psi }. \end{aligned}$$

Then we have

$$\begin{aligned} \begin{aligned}&\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon = 0} \mathrm {KL}(\rho _{[\varvec{\phi }^{\epsilon }]} || \rho ^{*}) \\&\quad = \frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \int _{{\mathcal {X}}^0} \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} \ln \left( \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} \right) \det ({\mathsf{F}}^{\epsilon })\mathrm {d}{\varvec{z}}\right. \\&\qquad \left. \quad \quad \quad \quad + \int _{{\mathcal {X}}^0} \frac{\rho _0}{\det ({\mathsf{F}}^{\epsilon })} V( \varvec{\phi }^{\epsilon }({\varvec{z}})) ~\det ({\mathsf{F}}^{\epsilon }) \mathrm {d}{\varvec{z}}\right) \\&\quad = \frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \int _{{\mathcal {X}}^0} \rho _0\ln \rho _0-\rho _0\ln \det {\mathsf{F}}^{\epsilon } + \rho _0 V( \varvec{\phi }^{\epsilon }({\varvec{z}})) \mathrm {d}{\varvec{z}}\right) \\&\quad =\int _{{\mathcal {X}}^0} -\rho _0\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon =0 }\left( \ln \det ({\mathsf{F}}^{\epsilon }) + V( \varvec{\phi }^{\epsilon }({\varvec{z}}))\right) \mathrm {d}{\varvec{z}}\\&\quad = \int _{{\mathcal {X}}^0} - \rho _0 ({\mathsf{F}}^{-\top }:\nabla _{{\varvec{z}}} \varvec{\psi }) + (\nabla _{{\varvec{x}}} V \cdot \varvec{\psi }) \rho _0 \mathrm {d}{\varvec{z}}. \end{aligned} \end{aligned}$$
(C.1)

For two matrices of the same size, define \(\mathsf{A}:\mathsf{B}=\sum _i\sum _j A_{ij}B_{ij}=\mathrm {tr}(\mathsf{A}^\top \mathsf{B})\). Since

$$\begin{aligned} \frac{\mathrm {d}\det ({\mathsf{F}}^{\epsilon })}{\mathrm {d}\epsilon }=\det ({\mathsf{F}}^{\epsilon })\mathrm {tr}\left[ ({\mathsf{F}}^{\epsilon })^{-1}\frac{\mathrm {d}{\mathsf{F}}^{\epsilon }}{\mathrm {d}\epsilon }\right] , \end{aligned}$$

we have

$$\begin{aligned} \frac{\mathrm {d}\det ({\mathsf{F}}^{\epsilon })}{\mathrm {d}\epsilon }\Big |_{\epsilon =0 }&=\det ({\mathsf{F}})\mathrm {tr}\left[ {\mathsf{F}}^{-1}\nabla _{{\varvec{z}}}\varvec{\psi }\right] \\&=\det ({\mathsf{F}})({\mathsf{F}}^{-\top }:\nabla _{{\varvec{z}}}\varvec{\psi }). \end{aligned}$$

Hence, we have the last result in (C.1).

Based on the definition of \(\varvec{\phi }\), we have the following.

$$\begin{aligned}&{\varvec{x}}=\varvec{\phi }({\varvec{z}}), \quad {\varvec{z}}=\varvec{\phi }^{-1}({\varvec{x}})\\&\varvec{\psi }({\varvec{z}})=\widetilde{\varvec{\psi }}(\varvec{\phi }({\varvec{z}}))=\widetilde{\varvec{\psi }}({\varvec{x}})\\&\rho ({\varvec{x}})=\rho _0(\varvec{\phi }^{-1}({\varvec{x}}))\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))\\&\rho _0({\varvec{z}})=\rho _0(\varvec{\phi }^{-1}({\varvec{x}}))=\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))}. \end{aligned}$$

The second summand of (C.1) becomes

$$\begin{aligned}&\int _{{\mathcal {X}}_0}\rho _0\left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \mathrm {d}{\varvec{z}}\\&\quad =\int _{{\mathcal {X}}_t}\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))} \left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \mathrm {d}\varvec{\phi }^{-1}({\varvec{x}})\\&\quad =\int _{{\mathcal {X}}_t}\frac{\rho ({\varvec{x}})}{\det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}}))} \left[ (\nabla _{{\varvec{x}}}V)^\top \varvec{\psi }\right] \det (\nabla _{{\varvec{x}}}\varvec{\phi }^{-1}({\varvec{x}})) \mathrm {d}{\varvec{x}}\\&\quad =\int _{{\mathcal {X}}_t}\rho ({\varvec{x}})\left[ (\nabla _{{\varvec{x}}}V)\cdot \varvec{\psi }\right] \mathrm {d}{\varvec{x}}. \end{aligned}$$

Now, we investigate the first summand in (C.1). Based on the definition of \(\widetilde{\varvec{\psi }}\), we can see that

$$\begin{aligned} (\nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }})_{i,j}&=\sum _{k=1}^d \frac{\partial \psi _i}{\partial z_k}\cdot \frac{\partial z_k}{\partial x_j}=\sum _{k=1}^d (\nabla _{{\varvec{z}}}\varvec{\psi })_{i,k}(\nabla _{{\varvec{x}}}\varvec{\phi }^{-1})_{k,j}\\ \nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }}&=\nabla _{{\varvec{z}}}\varvec{\psi }\left[ \nabla _{{\varvec{x}}}\varvec{\phi }^{-1} \right] ^\top =(\nabla _{{\varvec{z}}}\varvec{\psi }) {\mathsf{F}}^{-\top }, \end{aligned}$$

because \({\mathsf{F}}=\nabla _{{\varvec{z}}}\varvec{\phi }(z)=\left( \frac{\partial x_i}{\partial z_j}\right) _{i,j}\), \({\mathsf{F}}^{-1}=\left( \frac{\partial z_i}{\partial x_j}\right) _{i,j}=(\nabla _{{\varvec{x}}} \varvec{\phi }^{-1}({\varvec{x}}))^{-1}\). Divergence of \(\widetilde{\varvec{\psi }}(\varvec{x})\) is

$$\begin{aligned}&\nabla _{{\varvec{x}}} \cdot \widetilde{\varvec{\psi }}(\varvec{x})=\sum _{i=1}^d \frac{\partial {\tilde{\psi }}_i}{\partial x_i}=\mathrm {tr}(\text {Jacobian of }\widetilde{\varvec{\psi }})\\&\quad =\mathrm {tr}(\nabla _{{\varvec{x}}} \widetilde{\varvec{\psi }})=\mathrm {tr}((\nabla _{{\varvec{z}}}\varvec{\psi }) {\mathsf{F}}^{-\top })=\mathrm {tr}({\mathsf{F}}^{-1}\nabla _{{\varvec{z}}}\varvec{\psi }). \end{aligned}$$

Therefore, the first summand in (C.1) becomes,

$$\begin{aligned} -\int _{{\mathcal {X}}_0}\rho _0 ({\mathsf{F}}^{-\top } : \nabla _{{\varvec{z}}} \varvec{\psi })\mathrm {d}z =-\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}. \end{aligned}$$

Following the corollary of divergence theorem,

$$\begin{aligned}&\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}+ \int _{{\mathcal {X}}_t}\widetilde{\varvec{\psi }}^\top (\nabla _{{\varvec{x}}}\rho ({\varvec{x}}))\mathrm {d}{\varvec{x}}\\&\quad =\oint _{\partial {\mathcal {X}}_t} \rho ({\varvec{x}})(\widetilde{\varvec{\psi }}\cdot \varvec{\nu })\mathrm {d}S=0, \end{aligned}$$

because the boundary condition \(\widetilde{\varvec{\psi }}\cdot \varvec{\nu }=0\) on \(\partial {\mathcal {X}}_t\). So

$$\begin{aligned}&-\int _{{\mathcal {X}}_t}\rho ({\varvec{x}}) (\nabla _{{\varvec{x}}}\cdot \widetilde{\varvec{\psi }})\mathrm {d}{\varvec{x}}\\&\quad =\int _{{\mathcal {X}}_t}\widetilde{\varvec{\psi }}^\top (\nabla _{{\varvec{x}}}\rho ({\varvec{x}}))\mathrm {d}{\varvec{x}}=\int _{{\mathcal {X}}_t}\nabla _{{\varvec{x}}}\rho ({\varvec{x}})\cdot \widetilde{\varvec{\psi }}\mathrm {d}{\varvec{x}}. \end{aligned}$$

Therefore, in \({\mathcal {X}}^t\), by performing integration by parts, we have

$$\begin{aligned} \begin{aligned}&\frac{\mathrm {d}}{\mathrm {d}\epsilon } \Big |_{\epsilon = 0} \mathrm {KL}(\rho _{[\varvec{\phi }^{\epsilon }]} || \rho ^{*}) \\&\quad = \int _{{\mathcal {X}}^t} - \rho _{[\varvec{\phi }]} (\nabla _{{\varvec{x}}} \cdot \widetilde{ \varvec{\psi }}) + \rho \nabla V \cdot \widetilde{\varvec{\psi }} \mathrm {d}{\varvec{x}}\\&\quad = \int _{{\mathcal {X}}^t} (\nabla \rho + \rho \nabla V) \cdot \widetilde{\varvec{\psi }} \mathrm {d}{\varvec{x}}, \\ \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} \frac{\delta \mathrm {KL}(\rho _{[\varvec{\phi }]} || \rho ^{*})}{\delta \varvec{\phi }} = \nabla \rho + \rho \nabla V \end{aligned}$$
(C.2)

Recall \(V = - \ln \rho ^{*}\). One can notice that if F is an identity matrix, the result in (C.1) can be written as

$$\begin{aligned} - {\mathbb {E}}_{{\varvec{z}}\sim \rho _0} [\mathrm {trace} (\nabla _{{\varvec{z}}} \varvec{\psi } + \nabla \ln \rho ^{*} \varvec{\psi }^{\mathrm{T}})], \end{aligned}$$

which is exactly the form given by the Stein operator in Liu and Wang (2016).

Proof of Theorem 1

Proof

Let \({\mathbf {X}}\in {\mathbb {R}}^D\) be vectorized \(\{ {\varvec{x}}_i \}_{i = 1}^N\), that is

$$\begin{aligned} {\mathbf {X}}= (x_1^{(1)},\ldots x_N^{(1)}, \ldots x_1^{(d)} \ldots x_N^{(d)} ), \end{aligned}$$

where \(D = N \times d\). Recall that \(V({\varvec{x}}) = - \ln \rho ^{*}\). For a sufficient smooth target distribution \(\rho ^{*}({\varvec{x}})\), it is easy to show that

$$\begin{aligned} {\mathcal {F}}_h ( \{ {\varvec{x}}_i \}_{i=1}^{N}) = \frac{1}{N} \sum _{i=1}^N \left( \ln \left( \frac{1}{N} \sum _{j=1}^N K({\varvec{x}}_i, {\varvec{x}}_j) \right) + V({\varvec{x}}_i) \right) \end{aligned}$$

is continuous, coercive and bounded from below as a function of \({\mathbf {X}}\in {\mathbb {R}}^D\). We denote \({\mathcal {F}}_h ( \{ {\varvec{x}}_i \}_{i=1}^{N})\) by \({\mathcal {F}}_h ({\mathbf {X}})\).

For any given \(\{ {\varvec{x}}_i^n \}_{i=1}^N\), recall

$$\begin{aligned} J_n({\mathbf {X}}) = \frac{1}{2 \tau } \Vert {\mathbf {X}}- {\mathbf {X}}^n\Vert ^2 + {\mathcal {F}}_h({\mathbf {X}}), \end{aligned}$$

where \(\Vert \cdot \Vert ^2_{{\mathbf {X}}}\) is a norm for \({\mathbf {X}}\), defined by

$$\begin{aligned} \Vert {\mathbf {X}}- {\mathbf {X}}^n\Vert ^2_{{\mathbf {X}}} = \frac{1}{N} \sum _{i=1}^N \Vert {\varvec{x}}_i - {\varvec{x}}_i^n \Vert ^2. \end{aligned}$$

Since

$$\begin{aligned} {\mathcal {S}} = \{ J(\varvec{{\mathbf {X}}}) \le J(\varvec{{\mathbf {X}}}^n) \} \end{aligned}$$

is a non-empty, bounded, and closed set, by the coerciveness and continuity of \({\mathcal {F}}_h({\mathbf {X}})\), \(J_n({\mathbf {X}})\) admits a global minimizer \({\mathbf {X}}^{n+1}\) in \({\mathcal {S}}\). Since \({\mathbf {X}}^{n+1}\) is a global minimizer of \(J({\mathbf {X}})\), we have

$$\begin{aligned} \frac{1}{2 \tau } \Vert {\mathbf {X}}^{n+1} - {\mathbf {X}}^{n} \Vert ^2_{{\mathbf {X}}} + {\mathcal {F}}_h({\mathbf {X}}^{n+1}) \le {\mathcal {F}}_h({\mathbf {X}}^{n}), \end{aligned}$$

which gives us equation (3.19).

For series \(\{ {\mathbf {X}}^n \}\), since

$$\begin{aligned} \Vert {\mathbf {X}}^{k} - {\mathbf {X}}^{k-1} \Vert ^2_{{\mathbf {X}}} \le 2 \tau ({\mathcal {F}}_h({\mathbf {X}}^{k-1}) - {\mathcal {F}}_h({\mathbf {X}}^{k})), \end{aligned}$$

we have

$$\begin{aligned} \sum _{k=1}^n \Vert {\mathbf {X}}^{k} - {\mathbf {X}}^{k-1} \Vert ^2_{{\mathbf {X}}} \le 2 \tau ({\mathcal {F}}_h({\mathbf {X}}^{0}) - {\mathcal {F}}_h({\mathbf {X}}^{n})) \le C, \end{aligned}$$

for some constant C that is independent with n. Hence

$$\begin{aligned} \lim _{n\rightarrow \infty } \Vert {\mathbf {X}}^{n} - {\mathbf {X}}^{n-1} \Vert _{{\mathbf {X}}} = 0, \end{aligned}$$

which indicates the convergence of \(\{ {\mathbf {X}}^n \}\). Moreover, since

$$\begin{aligned} {\mathbf {X}}^{n} = {\mathbf {X}}^{n-1} - \tau \nabla _{{\mathbf {X}}} {\mathcal {F}}_h({\mathbf {X}}^{n}), \end{aligned}$$

we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \nabla _{{\mathbf {X}}} {\mathcal {F}}_h({\mathbf {X}}^{n}) = 0, \end{aligned}$$

so \(\{{\mathbf {X}}^n\}\) converges to a stationary point of \({\mathcal {F}}_h({\mathbf {X}})\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Chen, J., Liu, C. et al. Particle-based energetic variational inference. Stat Comput 31, 34 (2021). https://doi.org/10.1007/s11222-021-10009-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-021-10009-7

Keywords

Navigation