Skip to main content
Log in

Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

We consider the optimization problem of minimizing an objective functional, which admits a variational form and is defined over probability distributions on a constrained domain, which poses challenges to both theoretical analysis and algorithmic design. We propose Mirror Variational Transport (mirrorVT), which uses a set of samples, or particles, to represent the approximating distribution and deterministically updates the particles to optimize the functional. To deal with the constrained domain, in each iteration, mirrorVT maps the particles to an unconstrained dual domain, induced by a mirror map, and then approximately performs Wasserstein Gradient Descent on the manifold of distributions defined over the dual space to update each particle by a specified direction. At the end of each iteration, particles are mapped back to the original constrained domain. Through experiments on synthetic and real world data sets, we demonstrate the effectiveness of mirrorVT for the distributional optimization on the constrained domain. We also analyze its theoretical properties and characterize its convergence to the global minimum of the objective functional.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Code availability

The simulated data and source code for experiments can be accessed through https://github.com/haidnguyen0909/mirrorVT after the acceptance of the paper.

Notes

  1. 1-Wasserstein is defined as: \(\mathscr {W}_{1}(p, q)= \inf _{\pi \in \Pi (p,q)} \int _{\mathscr {X}\times \mathscr {X}} \Vert {\textbf {x}} - {\textbf {x}}^\prime \Vert _{2}\textrm{d}\pi ({\textbf {x}},{\textbf {x}}^\prime )\)

  2. We use [N] to indicate the list \(\left[ 1,2,\ldots ,N\right]\) throughout the rest of the paper.

  3. For any \({\textbf {x}}{}, {\textbf {x}}^\prime \in \mathscr {X}\) and \({\textbf {y}}=\nabla \varphi ({\textbf {x}}),{\textbf {y}}^\prime =\nabla \varphi ({\textbf {x}}^\prime )\), we have: \(\Vert \nabla g^{*}_{t}({\textbf {y}})-\nabla g^{*}_{t}({\textbf {y}}^\prime ) \Vert _{2} = \Vert \nabla ^{2} \varphi ({\textbf {x}})^{-1}\nabla f^{*}_{t}({\textbf {x}})-\nabla ^{2} \varphi ({\textbf {x}}^\prime )^{-1}\nabla f^{*}_{t}({\textbf {x}}^\prime )\Vert _{2} \le h \Vert {\textbf {x}}-{\textbf {x}}^{\prime } \Vert _{2} = h \Vert \nabla \varphi ^{*}({\textbf {y}})-\nabla \varphi ^{*}({\textbf {y}}^\prime ) \Vert _{2} \le h/\alpha \Vert {\textbf {y}}-{\textbf {y}}^\prime \Vert _{2}\), where the last inequality holds as \(\varphi ^{*}\) is \(1/\alpha\)-smooth.

References

  • Ahn, K., & Chewi, S. (2021). Efficient constrained sampling via the mirror-Langevin algorithm. Advances in Neural Information Processing Systems, 34, 28405–28418.

    Google Scholar 

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.

  • Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 167–175.

    Article  MathSciNet  MATH  Google Scholar 

  • Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2015). Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349

  • Cheng, X., & Bartlett, P. (2018). Convergence of Langevin MCMC in KL-divergence. In Algorithmic learning theory (pp. 186–211). PMLR.

  • Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26.

  • Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning (pp. 272–279).

  • Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723–773.

    MathSciNet  MATH  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hsieh, Y.-P., Kavis, A., Rolland, P., & Cevher, V. (2018). Mirrored Langevin dynamics. Advances in Neural Information Processing Systems, 31.

  • Joo, W., Lee, W., Park, S., & Moon, I.-C. (2020). Dirichlet variational autoencoder. Pattern Recognition, 107, 107514.

    Article  Google Scholar 

  • Kingma, D. P., & Welling, M. (2013) Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114

  • Koziel, S., & Michalewicz, Z. (1998). A decoder-based evolutionary algorithm for constrained parameter optimization problems. In Parallel problem solving from nature-PPSN V: 5th International conference Amsterdam, 1998 Proceedings (Vol. 5, pp. 231–240). Springer.

  • Liu, L., Zhang, Y., Yang, Z., Babanezhad, R., & Wang, Z. (2021). Infinite-dimensional optimization for zero-sum games via variational transport. In International conference on machine learning (pp. 7033–7044). PMLR.

  • Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 29.

  • Ma, Y.-A., Chen, T., & Fox, E. (2015). A complete recipe for stochastic gradient MCMC. Advances in Neural Information Processing Systems, 28.

  • Michalewicz, Z., & Schoenauer, M. (1996). Evolutionary algorithms for constrained parameter optimization problems. Evolutionary Computation, 4(1), 1–32.

    Article  Google Scholar 

  • Nguyen, D. H., Nguyen, C. H., & Mamitsuka, H. (2021). Learning subtree pattern importance for Weisfeiler–Lehman based graph kernels. Machine Learning, 110, 1585–1607.

    Article  MathSciNet  MATH  Google Scholar 

  • Nguyen, D. H., & Tsuda, K. (2023). On a linear fused Gromov–Wasserstein distance for graph structured data. Pattern Recognition (p. 109351).

  • Rosasco, L., Belkin, M., & De Vito, E. (2009). A note on learning with integral operators. In COLT. Citeseer.

  • Santambrogio, F. (2017). \(\{\)Euclidean, metric, and Wasserstein\(\}\) gradient flows: An overview. Bulletin of Mathematical Sciences, 7(1), 87–154.

    Article  MathSciNet  MATH  Google Scholar 

  • Shi, J., Liu, C., & Mackey, L. (2021). Sampling with mirrored stein operators. arXiv preprint arXiv:2106.12506

  • Villani, C. et al. (2009). Optimal transport: Old and new (Vol. 338). Springer.

  • Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688).

  • Wibisono, A. (2018). Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on learning theory (pp. 2093–3027). PMLR.

  • Xu, P., Chen, J., Zou, D., & Gu, Q. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31.

  • Zhang, H., & Sra, S. (2016). First-order methods for geodesically convex optimization. In Conference on learning theory (pp. 1617–1638). PMLR.

Download references

Funding

D. H. N. was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 23K16939. T. S. was supported by the New Energy and Industrial Technology Development Organization (NEDO) Grant Number JPNP18010 and Japan Science and Technology Agency (JST) Grant Number JPMJPF2017.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dai Hai Nguyen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, and Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

Appendix 1

1.1 Appendix 1.1: Proof of Theorem 2

Proof

We have assumed \(F(p)=\text {KL}(p||p^{*})\) for \(p,p^{*}\in \mathscr {P}_{2}(\mathscr {X})\), so \(G(q)=\text {KL}(q||q^{*})\), for \(q,q^{*}\in \mathscr {P}_{2}(\mathscr {Y})\). By the definition of the first variation of a functional, we have:

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\epsilon } G(q + \epsilon \chi )\bigg |_{\epsilon =0}=\int _{\mathscr {Y}}\frac{\partial G}{\partial q}({\textbf {y}})\chi ({\textbf {y}})\textrm{d}{} {\textbf {y}} \text {, for all } \chi \in \mathscr {P}_{2}(\mathscr {Y}) \end{aligned}$$

We can compute the left-hand side as follows:

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\epsilon } G(q + \epsilon \chi )\bigg |_{\epsilon =0}&=\frac{\textrm{d}}{\textrm{d}\epsilon } \text {KL}(q + \epsilon \chi || q^{*})\bigg |_{\epsilon =0}\\&= \frac{\textrm{d}}{\textrm{d}\epsilon }\int (q + \epsilon \chi )\log \left( \frac{q+\epsilon \chi }{q^{*}} \right) \textrm{d}{} {\textbf {y}}\bigg |_{\epsilon =0}\\&= \int \log \frac{q}{q^{*}}({\textbf {y}})\chi ({\textbf {y}})\textrm{d}{} {\textbf {y}} \end{aligned}$$

which indicates that \(\partial G/\partial q= \log q - \log q^{*}\). For t-th iteration, the update direction \(v_{t}\) is given by:

$$\begin{aligned} \begin{aligned} v_{t}({\textbf {x}})&= \nabla ^{2}\varphi ({\textbf {x}})^{-1}\nabla f^{*}_{t}({\textbf {x}})=\nabla g^{*}_{t}({\textbf {y}})\\&= \nabla \log q_{t}({\textbf {y}}) - \nabla \log q^{*}({\textbf {y}}) \end{aligned} \end{aligned}$$
(30)

for all \({\textbf {x}}\in \mathscr {X}, {\textbf {y}}=\nabla \varphi ({\textbf {x}})\in \mathscr {Y}\). By applying the integral operator \(\mathscr {L}_{k, p_{t}}\) (see Definition 1) to \(v_{t}\), we obtain:

$$\begin{aligned} \begin{aligned} \mathscr {L}_{k, p_{t}} v_{t}({\textbf {x}})&= \int _{\mathscr {X}}k({\textbf {x}}, {\textbf {x}}^\prime )v_{t}({\textbf {x}}^\prime )p_{t}({\textbf {x}}^\prime ) \textrm{d}{} {\textbf {x}}^\prime \\&= \int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q_{t}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d} {\textbf {y}}^\prime \\&= \int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime )\textrm{d}{} {\textbf {y}}^\prime \\&= -\int _{\mathscr {Y}}\nabla k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime ) q_{t}({\textbf {y}}^\prime ) \textrm{d}{} {\textbf {y}}^\prime -\int _{\mathscr {Y}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )q_{t}({\textbf {y}}^\prime ) \textrm{d}{} {\textbf {y}}^\prime \\&= -\,\mathbb {E}_{{\textbf {y}}^\prime \sim q_{t}}k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime )\nabla \log q^{*}({\textbf {y}}^\prime )+ \nabla k_{\varphi }({\textbf {y}}, {\textbf {y}}^\prime ) \end{aligned} \end{aligned}$$
(31)

The first equality is obtained by the definition of the integral operator (see Definition 1), the second equality is obtained by using (30) and the forth equality is obtained by applying the integration by parts to the first term. The proof is completed. \(\square\)

1.2 Appendix 1.2: Proof of Theorem 3

Proof

We analyze the performance of one step of mirrorVT. Under the Assumption 1.1 (\(L_{2}\)-smoothness of G), for any \(t\ge 0\), we have:

$$\begin{aligned} \begin{aligned} G(q_{t+1})&\le G(q_{t}) + \langle \texttt {grad}G(q_{t}), \texttt {Exp}_{q_{t}}^{-1}(q_{t+1})\rangle _{q_{t}} +1/2 L_{2}\cdot \mathscr {W}^{2}_{2}(q_{t+1},q_{t})\\&= G(q_{t}) - \eta _{t} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}}\\&\quad +1/2 L_{2}\eta _{t}^{2}\langle \texttt {grad}G(q_{t})+\tilde{\delta }_{t} , \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}}\\&= G(q_{t}) - \eta _{t} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t}) \rangle _{q_{t}} - \eta _{t} \langle \texttt {grad}G(q_{t}), \tilde{\delta }_{t}\rangle _{q_{t}}\\&\quad + 1/2 L_{2}\eta _{t}^{2}\langle \texttt {grad}G(q_{t})+\tilde{\delta }_{t} , \texttt {grad}G(q_{t})+\tilde{\delta }_{t} \rangle _{q_{t}} \end{aligned} \end{aligned}$$
(32)

where \(\eta _{t}\in \left( 0, \alpha /h\right]\) (see (11)) and \(\tilde{\delta }_{t}=-\texttt {div}(q_{t}(\nabla \tilde{g_{t}^{*}}-\nabla g^{*}_{t}))\) is the difference between the true 2-Wasserstein gradient at \(q_{t}\) given by \(\texttt {grad}G(q_{t})=-\texttt {div}(q_{t}\nabla g^{*}_{t})\) and its estimate given by \(-\texttt {div}(q_{t}\nabla \tilde{g_{t}^{*}})\). The corresponding expected gradient error for G is defined as:

$$\begin{aligned} \tilde{\epsilon }_{t}=\mathbb {E}\langle \tilde{\delta }_{t}, \tilde{\delta }_{t} \rangle _{q_{t}}=\mathbb {E}\int \Vert \nabla ^{2}\varphi ({\textbf {x}})^{-1}\left( \nabla \tilde{f_{t}^{*}}({\textbf {x}})-\nabla f^{*}_{t}({\textbf {x}})\right) \Vert ^{2}_{2}p_{t}({\textbf {x}})\textrm{d} {\textbf {x}} \end{aligned}$$
(33)

Also since \(0 \prec \alpha {\textbf {I}}\preceq \nabla ^{2}\varphi ({\textbf {x}})\) for all \({\textbf {x}}\in \mathscr {X}\), we have

$$\begin{aligned} \tilde{\epsilon }_{t} \le \frac{1}{\alpha ^{2}}\epsilon _{t} \end{aligned}$$
(34)

By applying the basic inequality: \(\langle \texttt {grad}G(q_{t}), \tilde{\delta }_{t}\rangle \le \frac{1}{2}\langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle + \frac{1}{2} \langle \tilde{\delta }_{t}, \tilde{\delta }_{t}\rangle\) and combining with (34), we have:

$$\begin{aligned} \begin{aligned} G(q_{t+1})&\le G(q_{t}) - 1/2 \cdot \eta _{t} (1 - 2\eta _{t} L_{2})\cdot \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle _{q_{t}} + \frac{\eta _{t} (1 + 2\eta _{t} L_{2})}{2\alpha ^{2}} \epsilon _{t}\\ \end{aligned} \end{aligned}$$
(35)

By the definition of the inner product on the tangent space and the assumption of \(\mu\)-strong convexity of F, we obtain the following inequality:

$$\begin{aligned} \begin{aligned} \langle \texttt {grad}G(q_{t}), \texttt {grad}G(q_{t})\rangle _{q_{t}}&=\int _{\mathscr {Y}} \Vert \nabla g^{*}_{t}({\textbf {y}})\Vert _{2}^{2}q_{t}({\textbf {y}})\textrm{d}{} {\textbf {y}}\\&=\int _{\mathscr {X}} \Vert \nabla ^{2}\varphi ({\textbf {x}}) \nabla f^{*}_{t}({\textbf {x}})\Vert _{2}^{2}p_{t}({\textbf {x}})\textrm{d} {\textbf {x}}\\&\ge \frac{1}{\alpha ^{2}}\int _{\mathscr {X}} \Vert \nabla f^{*}_{t}({\textbf {x}})\Vert _{2}^{2}p_{t}({\textbf {x}})\textrm{d}{} {\textbf {x}}=\frac{1}{\alpha ^{2}} \langle \texttt {grad}F(p_{t}), \texttt {grad}F(p_{t})\rangle _{p_{t}}\\&\ge \frac{\mu }{\alpha ^{2}} \left( F(p_{t})-F(p^{*})\right) \end{aligned} \end{aligned}$$
(36)

where the first inequality is obtained by \(\nabla ^{2}\varphi ({\textbf {x}})\preceq \beta {\textbf {I}}\) for all \({\textbf {x}}\in \mathscr {X}\) and the second inequality is obtained by Assumption 1.3 (see 26). Thus combining (35) and use the identity: \(F(p_{t})=G(q_{t})\), we have:

$$\begin{aligned} F(p_{t+1}) - F(p_{t}) \le \left[ 1 - \frac{\mu \eta _{t}}{2\beta ^{2}} (1 - 2\eta _{t}L_{2}))\right] \left( F(p_{t})-F^{*}\right) + \frac{\eta _{t} (1 + 2\eta _{t} L_{2})}{2\alpha ^{2}} \epsilon _{t} \end{aligned}$$
(37)

By setting \(\eta _{t} =\eta \le \min \left\{ \frac{1}{2L_{2}},\frac{1}{\mu \beta ^{2}}\right\}\), we have:

$$\begin{aligned} 1 - \frac{\mu \eta _{t}}{2\beta ^{2}} \left( 1 - 2\eta _{t}L_{2})\right) \le 1-\frac{\mu \eta _{t}}{2\beta ^{2}}, 0 \le 1-\frac{\mu \eta }{2\beta ^{2}} \text { and } \frac{\eta (1 + 2\eta L_{2})}{2\alpha ^{2}} \le \frac{\eta }{\alpha ^{2}} \end{aligned}$$
(38)

In the sequel, we define \(\rho = 1-\frac{\mu \eta }{2\beta ^{2}}\in \left[ 0,1 \right]\), we have:

$$\begin{aligned} F(p_{t+1}) - F(p_{t}) \le \rho \left( F(p_{t})-F^{*}\right) + \frac{\eta _{t}}{\alpha ^{2}} \epsilon _{t} \end{aligned}$$
(39)

By forming a telescoping sequence and combining the upper bound of \(\epsilon _{t}\) given in Liu et al. (2021), we have:

$$\begin{aligned} F(p_{t+1}) - \inf _{p\in \mathscr {P}_{2}(\mathscr {X})} F(p) \le \rho ^{t}\left( F(p_{1})-\inf _{p\in \mathscr {P}_{2}(\mathscr {X})} F(p)\right) + \frac{1-\rho ^{t}}{1-\rho }\frac{\eta }{\alpha ^{2}}\cdot \texttt {Error} \end{aligned}$$
(40)

Finally, by taking the expectation over the initial particle set, we complete the proof. \(\square\)

1.3 Appendix 1.3: Details of the mirror maps

In this section, we describe more details of the mirror maps used in our simulated experiments.

1.3.1 Appendix 1.3.1: Mirror map on the unit ball

For the mirror map defined in (28), we can easily shown that:

$$\begin{aligned} \frac{\partial \varphi }{\partial {\textbf {x}}_{i}}=\frac{{\textbf {x}}_{i}}{1-\Vert {\textbf {x}} \Vert _{2}}, \frac{\partial \varphi ^{*}}{\partial {\textbf {y}}_{i}}=\frac{{\textbf {y}}_{i}}{1+\Vert {\textbf {y}} \Vert _{2}}, \frac{\partial ^{2}\varphi }{\partial {\textbf {x}}_{i}\partial {\textbf {x}}_{j}}=\frac{\delta _{ij}}{1-\Vert {\textbf {x}} \Vert _{2}} + \frac{{\textbf {x}}_{i}{} {\textbf {x}}_{j}}{\Vert {\textbf {x}} \Vert _{2} \left( 1 - \Vert {\textbf {x}} \Vert _{2} \right) ^{2}} \end{aligned}$$

Hence the Hessian matrix can be written as: \(\nabla ^{2}\varphi ({\textbf {x}})=\frac{1}{1-\Vert {\textbf {x}} \Vert _{2}}{} {\textbf {I}}+\frac{1}{\Vert {\textbf {x}} \Vert _{2} \left( 1- \Vert {\textbf {x}} \Vert _{2}\right) ^{2}}{} {\textbf {x}} {\textbf {x}}^\top\), where \({\textbf {I}}\) is the identity matrix. In order to obtain the inversion of Hessian matrix, we apply the celebrated Woodbury matrix identity and show that

$$\begin{aligned} \left( \nabla ^{2}\varphi ({\textbf {x}})\right) ^{-1}=\left( 1 - \Vert {\textbf {x}} \Vert _{2}\right) \left( {\textbf {I}}-\frac{1}{\Vert {\textbf {x}} \Vert _{2}}{} {\textbf {x}} {\textbf {x}}^\top \right) \end{aligned}$$

1.3.2 Appendix 1.3.2: Mirror map on the simplex

For the entropic mirror map (see Beck & Teboulle 2003), we can consider \({\textbf {x}}=\left[ {\textbf {x}}_{1},\ldots ,{\textbf {x}}_{d-1}\right] \in \mathbb {R}^{d-1}\) by discarding the last entry \({\textbf {x}}_{d}=1 - \sum _{i=1}^{d-1}{} {\textbf {x}}_{i}\) and easily show that:

$$\begin{aligned} \frac{\partial \varphi }{\partial {\textbf {x}}_{i}}=\log {\textbf {x}}_{i} - \log {\textbf {x}}_{d}, \frac{\partial \varphi ^{*}}{\partial {\textbf {y}}_{i}}=\frac{\exp {\left( {\textbf {y}}_{i}\right) }}{\sum _{j=1}^{d-1}\exp {\left( {\textbf {y}}_{j}\right) }}, \frac{\partial ^{2}\varphi }{\partial {\textbf {x}}_{i}\partial {\textbf {x}}_{j}}=\frac{\delta _{ij}}{{\textbf {x}}_{i}} + \frac{1}{{\textbf {x}}_{d}}, \forall i \in \left[ d-1\right] \end{aligned}$$

Hence the Hessian matrix can be written as: \(\nabla ^{2}\varphi ({\textbf {x}})=\texttt {diag}(1/{\textbf {x}}_{1},1/{\textbf {x}}_{2},\ldots ,1/{\textbf {x}}_{d-1})+1/{\textbf {x}}_{d} {\textbf {1}}{} {\textbf {1}}^\top\). By applying the Sherman-Morrison formula, we obtain the inverse Hessian matrix of the following form:

$$\begin{aligned} \left( \nabla ^{2}\varphi ({\textbf {x}})\right) ^{-1}=\texttt {diag}({\textbf {x}}) - {\textbf {x}} {\textbf {x}}^\top \end{aligned}$$

1.4 Appendix 1.4: Details of network architectures

We present the neural network architectures of DirVAE on the image data set MNIST in Table 4 and on the text data set in Table 5.

Table 4 Network architecture of DirVAE for MNIST
Table 5 Network Architecture of DirVAE for IMDB

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, D.H., Sakurai, T. Mirror variational transport: a particle-based algorithm for distributional optimization on constrained domains. Mach Learn 112, 2845–2869 (2023). https://doi.org/10.1007/s10994-023-06350-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06350-9

Keywords

Navigation