Abstract
We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI objective function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing particle-based variational inference (ParVI) methods, including the popular Stein variational gradient descent (SVGD). More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or “Approximation-then-Variation” for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level, and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.
Similar content being viewed by others
Notes
References
Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 28(1), 131–142 (1966)
Ambrosio, L., Lisini, S., Savaré, G.: Stability of flows associated to gradient vector fields and convergence of iterated transport maps. Manuscr. Math. 121(1), 1–50 (2006)
Arbel, M., Korba, A., Salim, A., Gretton, A.: Maximum mean discrepancy gradient flow. In: Advances in Neural Information Processing Systems, pp. 6484–6494 (2019)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Carrillo, J.A., Lisini, S.: On the asymptotic behavior of the gradient flow of a polyconvex functional. Nonlinear Partial Differ. Equ. Hyperbolic Wave Phenom. 526, 37–51 (2010)
Carrillo, J.A., Düring, B., Matthes, D., McCormick, D.S.: A Lagrangian scheme for the solution of nonlinear diffusion equations using moving simplex meshes. J. Sci. Comput. 75(3), 1463–1499 (2018)
Carrillo, J.A., Craig, K., Patacchini, F.S.: A blob method for diffusion. Calc. Var. Partial. Differ. Equ. 58(2), 53 (2019)
Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)
Chen, C., Zhang, R., Wang, W., Li, B., Chen, L.: A unified particle-optimization framework for scalable Bayesian sampling (2018). arXiv preprint arXiv:1805.11659
Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., Ghattas, O.: Projected stein variational Newton: a fast and scalable Bayesian inference method in high dimensions (2019). arXiv preprint arXiv:1901.08659
Dai, B., He, N., Dai, H., Song, L.: Provable Bayesian inference via particle mirror descent. In: Artificial Intelligence and Statistics, pp. 985–994 (2016)
Degond, P., Mustieles, F.J.: A deterministic approximation of diffusion equations using particles. SIAM J. Sci. Comput. 11(2), 293–310 (1990)
Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., Scheichl, R.: A Stein variational Newton method. In: Advances in Neural Information Processing Systems, pp. 9169–9179 (2018)
Du, Q., Feng, X.: The phase field method for geometric moving interfaces and their numerical approximations (2019). arXiv preprint arXiv:1902.04924
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
El Moselhy, T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231(23), 7815–7850 (2012)
Evans, L.C., Savin, O., Gangbo, W.: Diffeomorphisms and nonlinear heat flows. SIAM J. Math. Anal. 37(3), 737–751 (2005)
Francois, D., Wertz, V., Verleysen, M., et al.: About the locality of kernels in high-dimensional spaces. In: International Symposium on Applied Stochastic Models and Data Analysis, pp. 238–245. Citeseer (2005)
Frogner, C., Poggio, T.: Approximate inference with Wasserstein gradient flows (2018). arXiv preprint arXiv:1806.04542
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. Chapman and Hall/CRC, Boca Raton (2013)
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984)
Gershman, S.J., Hoffman, M.D., Blei, D.M.: Nonparametric variational inference. In: Proceedings of the 29th International Conference on Machine Learning, pp. 235–242 (2012)
Giga, M.H., Kirshtein, A., Liu, C.: Variational modeling and complex fluids. Handbook of Mathematical Analysis in Mechanics of Viscous Fluids, pp. 1–41 (2017)
Gonzalez, O., Stuart, A.M.: A First Course in Continuum Mechanics. Cambridge University Press, Cambridge (2008)
Haario, H., Saksman, E., Tamminen, J.: Adaptive proposal distribution for random walk Metropolis algorithm. Comput. Stat. 14(3), 375–396 (1999)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Hohenberg, P.C., Halperin, B.I.: Theory of dynamic critical phenomena. Rev. Mod. Phys. 49(3), 435 (1977)
Iserles, A.: A first course in the numerical analysis of differential equations. No. 44 in Cambridge Texts in Applied Mathematics. Cambridge University Press, New York (2009)
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
Lacombe, G., Mas-Gallic, S.: Presentation and analysis of a diffusion-velocity method. In: ESAIM: Proceedings, vol. 7, pp. 225–233. EDP Sciences (1999)
Li, L., Liu, J.G., Liu, Z., Lu, J.: A stochastic version of Stein variational gradient descent for efficient sampling (2019). arXiv preprint arXiv:1902.03394
Liu, C.: An introduction of elastic complex fluids: an energetic variational approach. In: Multi-Scale Phenomena in Complex Fluids: Modeling, Analysis and Numerical Simulation, pp. 286–337. World Scientific (2009)
Liu, Q.: Stein variational gradient descent as gradient flow. In: Advances in Neural Information Processing Systems, pp. 3115–3123 (2017)
Liu, Q., Wang, D.: Stein variational gradient descent: a general purpose Bayesian inference algorithm. In: Advances in Neural Information Processing Systems, pp. 2378–2386 (2016)
Liu, C., Wang, Y.: On Lagrangian schemes for porous medium type generalized diffusion equations: a discrete energetic variational approach. J. Comput. Phys. 417, 109566 (2020a)
Liu, C., Wang, Y.: A variational Lagrangian scheme for a phase field model: a discrete energetic variational approach. SIAM J. Sci. Comput. 42(6), B1541–B1569 (2020b)
Liu, C., Zhu, J.: Riemannian Stein variational gradient descent for Bayesian inference. In: 32nd AAAI Conference on Artificial Intelligence (2018)
Liu, C., Zhuo, J., Cheng, P., Zhang, R., Zhu, J.: Understanding and accelerating particle-based variational inference. In: International Conference on Machine Learning, pp. 4082–4092 (2019)
Lu, J., Lu, Y., Nolen, J.: Scaling limit of the Stein variational gradient descent: the mean field regime. SIAM J. Math. Anal. 51(2), 648–671 (2019)
MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Matthes, D., Plazotta, S.: A variational formulation of the BDF2 method for metric gradient flows. ESAIM: Math. Model. Numer. Anal. 53(1), 145–172 (2019)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41–48. IEEE (1999)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada (1993)
Neal, R.M., Hinton, G.E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models, pp. 355–368. Springer (1998)
Onsager, L.: Reciprocal relations in irreversible processes. I. Phys. Rev. 37(4), 405 (1931a)
Onsager, L.: Reciprocal relations in irreversible processes. II. Phys. Rev. 38(12), 2265 (1931b)
Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference (2019). arXiv preprint arXiv:1912.02762
Parisi, G.: Correlation functions and computer simulations. Nucl. Phys. B 180(3), 378–384 (1981)
Rayleigh, L.: Note on the numerical calculation of the roots of fluctuating functions. Proc. Lond. Math. Soc. 1(1), 119–124 (1873)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows (2015). arXiv preprint arXiv:1505.05770
Roberts, G.O., Tweedie, R.L., et al.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Rossky, P.J., Doll, J.D., Friedman, H.L.: Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69(10), 4628–4633 (1978)
Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational inference: bridging the gap. In: International Conference on Machine Learning, pp. 1218–1226 (2015)
Salman, H., Yadollahpour, P., Fletcher, T., Batmanghelich, K.: Deep diffeomorphic normalizing flows (2018). arXiv preprint arXiv:1810.03256
Santambrogio, F.: \(\{\)Euclidean, metric, and Wasserstein\(\}\) gradient flows: an overview. Bull. Math. Sci 7(1), 87–154 (2017)
Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20(1), 31–82 (2019)
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)
Tabak, E.G., Vanden-Eijnden, E., et al.: Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci. 8(1), 217–233 (2010)
Temam, R., Miranville, A.: Mathematical Modeling in Continuum Mechanics. Cambridge University Press, Cambridge (2005)
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Berlin (2008)
Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
Wang, D., Tang, Z., Bajaj, C., Liu, Q.: Stein variational gradient descent with matrix-valued kernels. In: Advances in Neural Information Processing Systems, pp. 7834–7844 (2019)
Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning, pp. 681–688 (2011)
Acknowledgements
Y. Wang and C. Liu are partially supported by the National Science Foundation Grant DMS-1759536. L. Kang is partially supported by the National Science Foundation Grant DMS-1916467.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Energetic variational approach
In this appendix, we gives a brief introduction to the energetic variational approach. We refer interested readers to Liu (2009) and Giga et al. (2017) for a more comprehensive description.
As mentioned previously, the energetic variational approach provides a paradigm to determine the dynamics of a dissipative system from a prescribed energy-dissipation law, which shifts the main task in the modeling of a dynamic system to the construction of the energy-dissipation law. In physics, an energy-dissipation law, yielded by the first and second law of thermodynamics (Giga et al. 2017), is often given by
where \({\varvec{\phi }}\) is the state variable, \({\mathcal {K}}\) is the kinetic energy, \({\mathcal {F}}\) is the Helmholtz free energy, and \(2 {\mathcal {D}}\) is the rate of energy dissipation. If \({\mathcal {K}} = 0\), one can view (A.1) as a generalization of gradient flow (Hohenberg and Halperin 1977).
The Least Action Principle states that the equation of motion for a Hamiltonian system can be derived from the variation of the action functional \({\mathcal {A}} = \int _{0}^T ({\mathcal {K}} - {\mathcal {F}})\mathrm {d}{\varvec{x}}\) with respect to \({\varvec{\phi }}({\varvec{z}}, t)\) (the trajectory), i.e.,
This procedure yields the conservative forces of the system, that is \( (f_{\mathrm{inertial}} - f_{\mathrm{conv}}) = \frac{\delta {\mathcal {A}}}{\delta {\varvec{\phi }}}\). Meanwhile, according to the MDP, the dissipative force can be obtained by minimizing the dissipation functional with respect to the “rate” \(\dot{\varvec{\phi }}\), i.e.,
or \(f_{\mathrm{diss}} = \frac{\delta {\mathcal {D}}}{\delta \dot{\varvec{\phi }}}\). According to the Newton’s second law (\(F = ma\)), we have the force balance condition \(f_{\mathrm{inertial}} = f_{\mathrm{conv}} + f_{\mathrm{diss}}\) (\(f_{\mathrm{inertial}}\) plays role of ma), which defines the dynamics of the system
In the case that \({\mathcal {K}} = 0\), we have
Notice that the free energy is decreasing with respect to the time when \({\mathcal {K}} = 0\). As an analogy, if we consider VI objective functional as \({\mathcal {F}}\), (A.1) gives a continuous mechanism to decrease the free energy, and (A.2) or (A.3) gives the equation \({\varvec{\phi }}({\varvec{z}}, t)\).
Derivation of the transport equation
The transport equation can be derived from the conservation of probability mass directly. Let
be the deformation tensor associate with the flow map \(\phi ({\varvec{z}}, t)\), i.e., the Jacobian matrix of \({\varvec{\phi }}\), then due to the conservation of probability mass, we have
which implies that
Here the operation “ : ” between two matrix, \(\mathsf{A} : \mathsf{B} = \sum _{i} \sum _{j} A_{ij} B_{ij}\), is the Frobenius inner product between two matrices \(\mathsf{A}, \mathsf{B} \in {\mathbb {R}}^{n \times m}\).
Computation of Equation (2.10)
In this part, we give a detailed derivation of the variation of \(\mathrm {KL}(\rho _{[\varvec{\phi }]} | \rho ^{*})\) with respect to the flow map \(\varvec{\phi }({\varvec{z}}): {\mathcal {X}}^0 \rightarrow {\mathcal {X}}^t\). Consider a small perturbation of \(\varvec{\phi }\)
where \(\varvec{\psi }({\varvec{z}}) = \widetilde{\varvec{\psi }}(\varvec{\phi }({\varvec{z}}))\) is a smooth map satisfying
with \(\varvec{\nu }\) be the outward pointing unit normal on the boundary, \(\partial {\mathcal {X}}^t\). Thus, \(\widetilde{\varvec{\psi }}=\varvec{\psi }(\varvec{\phi }^{-1}({\varvec{x}}))\) and the above condition indicates that \(\widetilde{\varvec{\psi }}\) is diffused to zero at the boundary of \({\mathcal {X}}^t\). For \({\mathcal {X}}^0 = {\mathcal {X}}^d = {\mathbb {R}}^d\), \(\widetilde{\varvec{\psi }} \in C_0^{\infty } ({\mathbb {R}}^d)\). We denote \({\mathsf{F}}\) as the Jacobian matrix of \(\varvec{\phi }\), i.e., \(F_{ij}=\frac{\partial \phi _i}{\partial z_j}\), and \({\mathsf{F}}^{\epsilon }\) is the Jacobian matrix of \(\varvec{\phi }^{\epsilon }\), i.e.,
Then we have
For two matrices of the same size, define \(\mathsf{A}:\mathsf{B}=\sum _i\sum _j A_{ij}B_{ij}=\mathrm {tr}(\mathsf{A}^\top \mathsf{B})\). Since
we have
Hence, we have the last result in (C.1).
Based on the definition of \(\varvec{\phi }\), we have the following.
The second summand of (C.1) becomes
Now, we investigate the first summand in (C.1). Based on the definition of \(\widetilde{\varvec{\psi }}\), we can see that
because \({\mathsf{F}}=\nabla _{{\varvec{z}}}\varvec{\phi }(z)=\left( \frac{\partial x_i}{\partial z_j}\right) _{i,j}\), \({\mathsf{F}}^{-1}=\left( \frac{\partial z_i}{\partial x_j}\right) _{i,j}=(\nabla _{{\varvec{x}}} \varvec{\phi }^{-1}({\varvec{x}}))^{-1}\). Divergence of \(\widetilde{\varvec{\psi }}(\varvec{x})\) is
Therefore, the first summand in (C.1) becomes,
Following the corollary of divergence theorem,
because the boundary condition \(\widetilde{\varvec{\psi }}\cdot \varvec{\nu }=0\) on \(\partial {\mathcal {X}}_t\). So
Therefore, in \({\mathcal {X}}^t\), by performing integration by parts, we have
which implies that
Recall \(V = - \ln \rho ^{*}\). One can notice that if F is an identity matrix, the result in (C.1) can be written as
which is exactly the form given by the Stein operator in Liu and Wang (2016).
Proof of Theorem 1
Proof
Let \({\mathbf {X}}\in {\mathbb {R}}^D\) be vectorized \(\{ {\varvec{x}}_i \}_{i = 1}^N\), that is
where \(D = N \times d\). Recall that \(V({\varvec{x}}) = - \ln \rho ^{*}\). For a sufficient smooth target distribution \(\rho ^{*}({\varvec{x}})\), it is easy to show that
is continuous, coercive and bounded from below as a function of \({\mathbf {X}}\in {\mathbb {R}}^D\). We denote \({\mathcal {F}}_h ( \{ {\varvec{x}}_i \}_{i=1}^{N})\) by \({\mathcal {F}}_h ({\mathbf {X}})\).
For any given \(\{ {\varvec{x}}_i^n \}_{i=1}^N\), recall
where \(\Vert \cdot \Vert ^2_{{\mathbf {X}}}\) is a norm for \({\mathbf {X}}\), defined by
Since
is a non-empty, bounded, and closed set, by the coerciveness and continuity of \({\mathcal {F}}_h({\mathbf {X}})\), \(J_n({\mathbf {X}})\) admits a global minimizer \({\mathbf {X}}^{n+1}\) in \({\mathcal {S}}\). Since \({\mathbf {X}}^{n+1}\) is a global minimizer of \(J({\mathbf {X}})\), we have
which gives us equation (3.19).
For series \(\{ {\mathbf {X}}^n \}\), since
we have
for some constant C that is independent with n. Hence
which indicates the convergence of \(\{ {\mathbf {X}}^n \}\). Moreover, since
we have
so \(\{{\mathbf {X}}^n\}\) converges to a stationary point of \({\mathcal {F}}_h({\mathbf {X}})\). \(\square \)
Rights and permissions
About this article
Cite this article
Wang, Y., Chen, J., Liu, C. et al. Particle-based energetic variational inference. Stat Comput 31, 34 (2021). https://doi.org/10.1007/s11222-021-10009-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-021-10009-7