Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

Nguyen, Dai Hai; Sakurai, Tetsuya

doi:10.1007/s10994-023-06350-9

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

Published: 27 June 2023

Volume 112, pages 2845–2869, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

413 Accesses
2 Altmetric
Explore all metrics

Abstract

We consider the optimization problem of minimizing an objective functional, which admits a variational form and is defined over probability distributions on a constrained domain, which poses challenges to both theoretical analysis and algorithmic design. We propose Mirror Variational Transport (mirrorVT), which uses a set of samples, or particles, to represent the approximating distribution and deterministically updates the particles to optimize the functional. To deal with the constrained domain, in each iteration, mirrorVT maps the particles to an unconstrained dual domain, induced by a mirror map, and then approximately performs Wasserstein Gradient Descent on the manifold of distributions defined over the dual space to update each particle by a specified direction. At the end of each iteration, particles are mapped back to the original constrained domain. Through experiments on synthetic and real world data sets, we demonstrate the effectiveness of mirrorVT for the distributional optimization on the constrained domain. We also analyze its theoretical properties and characterize its convergence to the global minimum of the objective functional.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Particle-based energetic variational inference

Article 17 April 2021

Semi-discrete optimal transport: hardness, regularization and numerical solution

Article Open access 25 July 2022

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Article 09 August 2022

Code availability

The simulated data and source code for experiments can be accessed through https://github.com/haidnguyen0909/mirrorVT after the acceptance of the paper.

Notes

1-Wasserstein is defined as: $\mathscr {W}_{1}(p, q)= \inf _{\pi \in \Pi (p,q)} \int _{\mathscr {X}\times \mathscr {X}} \Vert {\textbf {x}} - {\textbf {x}}^\prime \Vert _{2}\textrm{d}\pi ({\textbf {x}},{\textbf {x}}^\prime )$
We use [N] to indicate the list $\left[ 1,2,\ldots ,N\right]$ throughout the rest of the paper.
For any ${\textbf {x}}{}, {\textbf {x}}^\prime \in \mathscr {X}$ and ${\textbf {y}}=\nabla \varphi ({\textbf {x}}),{\textbf {y}}^\prime =\nabla \varphi ({\textbf {x}}^\prime )$, we have: $\Vert \nabla g^{*}_{t}({\textbf {y}})-\nabla g^{*}_{t}({\textbf {y}}^\prime ) \Vert _{2} = \Vert \nabla ^{2} \varphi ({\textbf {x}})^{-1}\nabla f^{*}_{t}({\textbf {x}})-\nabla ^{2} \varphi ({\textbf {x}}^\prime )^{-1}\nabla f^{*}_{t}({\textbf {x}}^\prime )\Vert _{2} \le h \Vert {\textbf {x}}-{\textbf {x}}^{\prime } \Vert _{2} = h \Vert \nabla \varphi ^{*}({\textbf {y}})-\nabla \varphi ^{*}({\textbf {y}}^\prime ) \Vert _{2} \le h/\alpha \Vert {\textbf {y}}-{\textbf {y}}^\prime \Vert _{2}$, where the last inequality holds as $\varphi ^{*}$ is $1/\alpha$-smooth.

References

Ahn, K., & Chewi, S. (2021). Efficient constrained sampling via the mirror-Langevin algorithm. Advances in Neural Information Processing Systems, 34, 28405–28418.
Google Scholar
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.
Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175.
Article MathSciNet MATH Google Scholar
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2015). Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349
Cheng, X., & Bartlett, P. (2018). Convergence of Langevin MCMC in KL-divergence. In Algorithmic learning theory (pp. 186–211). PMLR.
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26.
Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning (pp. 272–279).
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723–773.
MathSciNet MATH Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hsieh, Y.-P., Kavis, A., Rolland, P., & Cevher, V. (2018). Mirrored Langevin dynamics. Advances in Neural Information Processing Systems, 31.
Joo, W., Lee, W., Park, S., & Moon, I.-C. (2020). Dirichlet variational autoencoder. Pattern Recognition, 107, 107514.
Article Google Scholar
Kingma, D. P., & Welling, M. (2013) Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114
Koziel, S., & Michalewicz, Z. (1998). A decoder-based evolutionary algorithm for constrained parameter optimization problems. In Parallel problem solving from nature-PPSN V: 5th International conference Amsterdam, 1998 Proceedings (Vol. 5, pp. 231–240). Springer.
Liu, L., Zhang, Y., Yang, Z., Babanezhad, R., & Wang, Z. (2021). Infinite-dimensional optimization for zero-sum games via variational transport. In International conference on machine learning (pp. 7033–7044). PMLR.
Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 29.
Ma, Y.-A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC. Advances in Neural Information Processing Systems, 28.
Michalewicz, Z., & Schoenauer, M. (1996). Evolutionary algorithms for constrained parameter optimization problems. Evolutionary Computation, 4(1), 1–32.
Article Google Scholar
Nguyen, D. H., Nguyen, C. H., & Mamitsuka, H. (2021). Learning subtree pattern importance for Weisfeiler–Lehman based graph kernels. Machine Learning, 110, 1585–1607.
Article MathSciNet MATH Google Scholar
Nguyen, D. H., & Tsuda, K. (2023). On a linear fused Gromov–Wasserstein distance for graph structured data. Pattern Recognition (p. 109351).
Rosasco, L., Belkin, M., & De Vito, E. (2009). A note on learning with integral operators. In COLT. Citeseer.
Santambrogio, F. (2017). $\{$Euclidean, metric, and Wasserstein$\}$ gradient flows: An overview. Bulletin of Mathematical Sciences, 7(1), 87–154.
Article MathSciNet MATH Google Scholar
Shi, J., Liu, C., & Mackey, L. (2021). Sampling with mirrored stein operators. arXiv preprint arXiv:2106.12506
Villani, C. et al. (2009). Optimal transport: Old and new (Vol. 338). Springer.
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688).
Wibisono, A. (2018). Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on learning theory (pp. 2093–3027). PMLR.
Xu, P., Chen, J., Zou, D., & Gu, Q. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31.
Zhang, H., & Sra, S. (2016). First-order methods for geodesically convex optimization. In Conference on learning theory (pp. 1617–1638). PMLR.

Download references

Funding

D. H. N. was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 23K16939. T. S. was supported by the New Energy and Industrial Technology Development Organization (NEDO) Grant Number JPNP18010 and Japan Science and Technology Agency (JST) Grant Number JPMJPF2017.

Author information

Authors and Affiliations

Department of Computer Science, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
Dai Hai Nguyen & Tetsuya Sakurai

Authors

Dai Hai Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Sakurai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dai Hai Nguyen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, and Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

1.1 Appendix 1.1: Proof of Theorem 2

Proof

We have assumed $F(p)=\text {KL}(p||p^{*})$ for $p,p^{*}\in \mathscr {P}_{2}(\mathscr {X})$, so $G(q)=\text {KL}(q||q^{*})$, for $q,q^{*}\in \mathscr {P}_{2}(\mathscr {Y})$. By the definition of the first variation of a functional, we have:

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\epsilon } G(q + \epsilon \chi )\bigg |_{\epsilon =0}=\int _{\mathscr {Y}}\frac{\partial G}{\partial q}({\textbf {y}})\chi ({\textbf {y}})\textrm{d}{} {\textbf {y}} \text {, for all } \chi \in \mathscr {P}_{2}(\mathscr {Y}) \end{aligned}$$

We can compute the left-hand side as follows:

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\epsilon } G(q + \epsilon \chi )\bigg |_{\epsilon =0}&=\frac{\textrm{d}}{\textrm{d}\epsilon } \text {KL}(q + \epsilon \chi || q^{*})\bigg |_{\epsilon =0}\\&= \frac{\textrm{d}}{\textrm{d}\epsilon }\int (q + \epsilon \chi )\log \left( \frac{q+\epsilon \chi }{q^{*}} \right) \textrm{d}{} {\textbf {y}}\bigg |_{\epsilon =0}\\&= \int \log \frac{q}{q^{*}}({\textbf {y}})\chi ({\textbf {y}})\textrm{d}{} {\textbf {y}} \end{aligned}$$

which indicates that $\partial G/\partial q= \log q - \log q^{*}$. For t-th iteration, the update direction $v_{t}$ is given by:

$$\begin{aligned} \begin{aligned} v_{t}({\textbf {x}})&= \nabla ^{2}\varphi ({\textbf {x}})^{-1}\nabla f^{*}_{t}({\textbf {x}})=\nabla g^{*}_{t}({\textbf {y}})\\&= \nabla \log q_{t}({\textbf {y}}) - \nabla \log q^{*}({\textbf {y}}) \end{aligned} \end{aligned}$$

(30)

for all ${\textbf {x}}\in \mathscr {X}, {\textbf {y}}=\nabla \varphi ({\textbf {x}})\in \mathscr {Y}$. By applying the integral operator $\mathscr {L}_{k, p_{t}}$ (see Definition 1) to $v_{t}$, we obtain:

$$\begin{aligned} \begin{aligned} \mathscr {L}_{k, p_{t}} v_{t}({\textbf {x}})&= \int _{\mathscr {X}}k({\textbf {x}}, {\textbf {x}}^\prime )v_{t}({\textbf {x}}^\prime )p_{t}({\textbf {x}}^\prime ) \textrm{d}{} {\textbf {x}}^\prime \\&= \int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q_{t}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d} {\textbf {y}}^\prime \\&= \int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime \\&= -\int _{\mathscr {Y}}\nabla k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime ) q_{t}({\textbf {y}}^\prime ) \textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime ) \textrm{d}{} {\textbf {y}}^\prime \\&= -\,\mathbb {E}_{{\textbf {y}}^\prime \sim q_{t}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )+ \nabla k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime ) \end{aligned} \end{aligned}$$

(31)

The first equality is obtained by the definition of the integral operator (see Definition 1), the second equality is obtained by using (30) and the forth equality is obtained by applying the integration by parts to the first term. The proof is completed. $\square$

1.2 Appendix 1.2: Proof of Theorem 3

Proof

We analyze the performance of one step of mirrorVT. Under the Assumption 1.1 ($L_{2}$-smoothness of G), for any $t\ge 0$, we have:

$$\begin{aligned} \begin{aligned} G(q_{t+1})&\le G(q_{t}) + \langle \texttt {grad}G(q_{t}), \texttt {Exp}_{q_{t}}^{-1}(q_{t+1})\rangle _{q_{t}} +1/2 L_{2}\cdot \mathscr {W}^{2}_{2}(q_{t+1},q_{t})\\&= G(q_{t}) - \eta _{t} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}}\\&\quad +1/2 L_{2}\eta _{t}^{2}\langle \texttt {grad}G(q_{t})+\tilde{\delta }_{t} , \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}}\\&= G(q_{t}) - \eta _{t} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t}) \rangle _{q_{t}} - \eta _{t} \langle \texttt {grad}G(q_{t}), \tilde{\delta }_{t}\rangle _{q_{t}}\\&\quad + 1/2 L_{2}\eta _{t}^{2}\langle \texttt {grad}G(q_{t})+\tilde{\delta }_{t} , \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}} \end{aligned} \end{aligned}$$

(32)

where $\eta _{t}\in \left( 0, \alpha /h\right]$ (see (11)) and $\tilde{\delta }_{t}=-\texttt {div}(q_{t}(\nabla \tilde{g_{t}^{*}}-\nabla g^{*}_{t}))$ is the difference between the true 2-Wasserstein gradient at $q_{t}$ given by $\texttt {grad}G(q_{t})=-\texttt {div}(q_{t}\nabla g^{*}_{t})$ and its estimate given by $-\texttt {div}(q_{t}\nabla \tilde{g_{t}^{*}})$. The corresponding expected gradient error for G is defined as:

$$\begin{aligned} \tilde{\epsilon }_{t}=\mathbb {E}\langle \tilde{\delta }_{t}, \tilde{\delta }_{t} \rangle _{q_{t}}=\mathbb {E}\int \Vert \nabla ^{2}\varphi ({\textbf {x}})^{-1}\left( \nabla \tilde{f_{t}^{*}}({\textbf {x}})-\nabla f^{*}_{t}({\textbf {x}})\right) \Vert ^{2}_{2}p_{t}({\textbf {x}})\textrm{d} {\textbf {x}} \end{aligned}$$

(33)

Also since $0 \prec \alpha {\textbf {I}}\preceq \nabla ^{2}\varphi ({\textbf {x}})$ for all ${\textbf {x}}\in \mathscr {X}$, we have

$$\begin{aligned} \tilde{\epsilon }_{t} \le \frac{1}{\alpha ^{2}}\epsilon _{t} \end{aligned}$$

(34)

By applying the basic inequality: $\langle \texttt {grad}G(q_{t}), \tilde{\delta }_{t}\rangle \le \frac{1}{2}\langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle + \frac{1}{2} \langle \tilde{\delta }_{t}, \tilde{\delta }_{t}\rangle$ and combining with (34), we have:

$$\begin{aligned} \begin{aligned} G(q_{t+1})&\le G(q_{t}) - 1/2 \cdot \eta _{t} (1 - 2\eta _{t} L_{2})\cdot \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle _{q_{t}} + \frac{\eta _{t} (1 + 2\eta _{t} L_{2})}{2\alpha ^{2}} \epsilon _{t}\\ \end{aligned} \end{aligned}$$

(35)

By the definition of the inner product on the tangent space and the assumption of $\mu$-strong convexity of F, we obtain the following inequality:

$$\begin{aligned} \begin{aligned} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle _{q_{t}}&=\int _{\mathscr {Y}} \Vert \nabla g^{*}_{t}({\textbf {y}})\Vert _{2}^{2}q_{t}({\textbf {y}})\textrm{d}{} {\textbf {y}}\\&=\int _{\mathscr {X}} \Vert \nabla ^{2}\varphi ({\textbf {x}}) \nabla f^{*}_{t}({\textbf {x}})\Vert _{2}^{2}p_{t}({\textbf {x}})\textrm{d} {\textbf {x}}\\&\ge \frac{1}{\alpha ^{2}}\int _{\mathscr {X}} \Vert \nabla f^{*}_{t}({\textbf {x}})\Vert _{2}^{2}p_{t}({\textbf {x}})\textrm{d}{} {\textbf {x}}=\frac{1}{\alpha ^{2}} \langle \texttt {grad}F(p_{t}), \texttt {grad}F(p_{t})\rangle _{p_{t}}\\&\ge \frac{\mu }{\alpha ^{2}} \left( F(p_{t})-F(p^{*})\right) \end{aligned} \end{aligned}$$

(36)

where the first inequality is obtained by $\nabla ^{2}\varphi ({\textbf {x}})\preceq \beta {\textbf {I}}$ for all ${\textbf {x}}\in \mathscr {X}$ and the second inequality is obtained by Assumption 1.3 (see 26). Thus combining (35) and use the identity: $F(p_{t})=G(q_{t})$, we have:

$$\begin{aligned} F(p_{t+1}) - F(p_{t}) \le \left[ 1 - \frac{\mu \eta _{t}}{2\beta ^{2}} (1 - 2\eta _{t}L_{2}))\right] \left( F(p_{t})-F^{*}\right) + \frac{\eta _{t} (1 + 2\eta _{t} L_{2})}{2\alpha ^{2}} \epsilon _{t} \end{aligned}$$

(37)

By setting $\eta _{t} =\eta \le \min \left\{ \frac{1}{2L_{2}},\frac{1}{\mu \beta ^{2}}\right\}$, we have:

$$\begin{aligned} 1 - \frac{\mu \eta _{t}}{2\beta ^{2}} \left( 1 - 2\eta _{t}L_{2})\right) \le 1-\frac{\mu \eta _{t}}{2\beta ^{2}}, 0 \le 1-\frac{\mu \eta }{2\beta ^{2}} \text { and } \frac{\eta (1 + 2\eta L_{2})}{2\alpha ^{2}} \le \frac{\eta }{\alpha ^{2}} \end{aligned}$$

(38)

In the sequel, we define $\rho = 1-\frac{\mu \eta }{2\beta ^{2}}\in \left[ 0,1 \right]$, we have:

$$\begin{aligned} F(p_{t+1}) - F(p_{t}) \le \rho \left( F(p_{t})-F^{*}\right) + \frac{\eta _{t}}{\alpha ^{2}} \epsilon _{t} \end{aligned}$$

(39)

By forming a telescoping sequence and combining the upper bound of $\epsilon _{t}$ given in Liu et al. (2021), we have:

$$\begin{aligned} F(p_{t+1}) - \inf _{p\in \mathscr {P}_{2}(\mathscr {X})} F(p) \le \rho ^{t}\left( F(p_{1})-\inf _{p\in \mathscr {P}_{2}(\mathscr {X})} F(p)\right) + \frac{1-\rho ^{t}}{1-\rho }\frac{\eta }{\alpha ^{2}}\cdot \texttt {Error} \end{aligned}$$

(40)

Finally, by taking the expectation over the initial particle set, we complete the proof. $\square$

1.3 Appendix 1.3: Details of the mirror maps

In this section, we describe more details of the mirror maps used in our simulated experiments.

1.3.1 Appendix 1.3.1: Mirror map on the unit ball

For the mirror map defined in (28), we can easily shown that:

$$\begin{aligned} \frac{\partial \varphi }{\partial {\textbf {x}}_{i}}=\frac{{\textbf {x}}_{i}}{1-\Vert {\textbf {x}} \Vert _{2}}, \frac{\partial \varphi ^{*}}{\partial {\textbf {y}}_{i}}=\frac{{\textbf {y}}_{i}}{1+\Vert {\textbf {y}} \Vert _{2}}, \frac{\partial ^{2}\varphi }{\partial {\textbf {x}}_{i}\partial {\textbf {x}}_{j}}=\frac{\delta _{ij}}{1-\Vert {\textbf {x}} \Vert _{2}} + \frac{{\textbf {x}}_{i}{} {\textbf {x}}_{j}}{\Vert {\textbf {x}} \Vert _{2} \left( 1 - \Vert {\textbf {x}} \Vert _{2} \right) ^{2}} \end{aligned}$$

Hence the Hessian matrix can be written as: $\nabla ^{2}\varphi ({\textbf {x}})=\frac{1}{1-\Vert {\textbf {x}} \Vert _{2}}{} {\textbf {I}}+\frac{1}{\Vert {\textbf {x}} \Vert _{2} \left( 1- \Vert {\textbf {x}} \Vert _{2}\right) ^{2}}{} {\textbf {x}} {\textbf {x}}^\top$, where ${\textbf {I}}$ is the identity matrix. In order to obtain the inversion of Hessian matrix, we apply the celebrated Woodbury matrix identity and show that

$$\begin{aligned} \left( \nabla ^{2}\varphi ({\textbf {x}})\right) ^{-1}=\left( 1 - \Vert {\textbf {x}} \Vert _{2}\right) \left( {\textbf {I}}-\frac{1}{\Vert {\textbf {x}} \Vert _{2}}{} {\textbf {x}} {\textbf {x}}^\top \right) \end{aligned}$$

1.3.2 Appendix 1.3.2: Mirror map on the simplex

For the entropic mirror map (see Beck & Teboulle 2003), we can consider ${\textbf {x}}=\left[ {\textbf {x}}_{1},\ldots ,{\textbf {x}}_{d-1}\right] \in \mathbb {R}^{d-1}$ by discarding the last entry ${\textbf {x}}_{d}=1 - \sum _{i=1}^{d-1}{} {\textbf {x}}_{i}$ and easily show that:

$$\begin{aligned} \frac{\partial \varphi }{\partial {\textbf {x}}_{i}}=\log {\textbf {x}}_{i} - \log {\textbf {x}}_{d}, \frac{\partial \varphi ^{*}}{\partial {\textbf {y}}_{i}}=\frac{\exp {\left( {\textbf {y}}_{i}\right) }}{\sum _{j=1}^{d-1}\exp {\left( {\textbf {y}}_{j}\right) }}, \frac{\partial ^{2}\varphi }{\partial {\textbf {x}}_{i}\partial {\textbf {x}}_{j}}=\frac{\delta _{ij}}{{\textbf {x}}_{i}} + \frac{1}{{\textbf {x}}_{d}}, \forall i \in \left[ d-1\right] \end{aligned}$$

Hence the Hessian matrix can be written as: $\nabla ^{2}\varphi ({\textbf {x}})=\texttt {diag}(1/{\textbf {x}}_{1},1/{\textbf {x}}_{2},\ldots ,1/{\textbf {x}}_{d-1})+1/{\textbf {x}}_{d} {\textbf {1}}{} {\textbf {1}}^\top$. By applying the Sherman-Morrison formula, we obtain the inverse Hessian matrix of the following form:

$$\begin{aligned} \left( \nabla ^{2}\varphi ({\textbf {x}})\right) ^{-1}=\texttt {diag}({\textbf {x}}) - {\textbf {x}} {\textbf {x}}^\top \end{aligned}$$

1.4 Appendix 1.4: Details of network architectures

We present the neural network architectures of DirVAE on the image data set MNIST in Table 4 and on the text data set in Table 5.

Table 4 Network architecture of DirVAE for MNIST

Full size table

Table 5 Network Architecture of DirVAE for IMDB

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nguyen, D.H., Sakurai, T. Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains. Mach Learn 112, 2845–2869 (2023). https://doi.org/10.1007/s10994-023-06350-9

Download citation

Received: 16 November 2022
Revised: 09 March 2023
Accepted: 12 May 2023
Published: 27 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10994-023-06350-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

Abstract

Access this article

Similar content being viewed by others

Particle-based energetic variational inference

Semi-discrete optimal transport: hardness, regularization and numerical solution

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1

1.1 Appendix 1.1: Proof of Theorem 2

Proof

1.2 Appendix 1.2: Proof of Theorem 3

Proof

1.3 Appendix 1.3: Details of the mirror maps

1.3.1 Appendix 1.3.1: Mirror map on the unit ball

1.3.2 Appendix 1.3.2: Mirror map on the simplex

1.4 Appendix 1.4: Details of network architectures

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

Abstract

Access this article

Similar content being viewed by others

Particle-based energetic variational inference

Semi-discrete optimal transport: hardness, regularization and numerical solution

The computational asymptotics of Gaussian variational inference and the Laplace approximation

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1

Appendix 1

1.1 Appendix 1.1: Proof of Theorem 2

Proof

1.2 Appendix 1.2: Proof of Theorem 3

Proof

1.3 Appendix 1.3: Details of the mirror maps

1.3.1 Appendix 1.3.1: Mirror map on the unit ball

1.3.2 Appendix 1.3.2: Mirror map on the simplex

1.4 Appendix 1.4: Details of network architectures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation