Abstract
Inspired by recent advances in distributed algorithms for approximating Wasserstein barycenters, we propose a novel distributed algorithm for this problem. The main novelty is that we consider time-varying computational networks, which are motivated by examples when only a subset of sensors can observe each time step, and yet, the goal is to average signals (e.g., satellite pictures of some area) by approximating their barycenter. We embed this problem into a class of non-smooth dual-friendly distributed optimization problems over time-varying networks and develop a first-order method for this class. We prove non-asymptotic accelerated in the sense of Nesterov convergence rates and explicitly characterize their dependence on the parameters of the network and its dynamics. In the experiments, we demonstrate the efficiency of the proposed algorithm when applied to the Wasserstein barycenter problem.
Similar content being viewed by others
Notes
e.g., one can take \(f_i^{\gamma }(x) = f_i(x) + \frac{\gamma }{2}\Vert x\Vert ^2_2\).
References
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. SIAM J Math Anal 43(2):904–924
Barrio E, Gine E, Matran C (1999) Central limit theorems for the Wasserstein distance between the empirical and the true distributions. Ann Probab 27(2):1009–1071
Bigot J, Cazelles E, Papadakis N (2019) Data-driven regularization of Wasserstein barycenters with an application to multivariate density registration. Inf. Inference J IMA 8(4):719–755
Bishop AN, Doucet A (2021) Network consensus in the Wasserstein metric space of probability measures. SIAM J Control Optim 59(5):3261–3277
Boissard E, Le Gouic T, Loubes J-M (2015) Distribution’s template estimate with Wasserstein metrics. Bernoulli 21(2):740–759
Cuturi M, Peyré G (2015) A smoothed dual approach for variational Wasserstein problems. SIAM J. Imag. Sci. 9(1):320–343
Cuturi M, Doucet A (2014) Fast computation of wasserstein barycenters. In: International Conference on Machine Learning. PMLR, pp. 685–693
Devolder O, Glineur F, Nesterov Y (2012) Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J Optim 22(2):702–727
Dvinskikh D (2021) Decentralized algorithms for wasserstein barycenters. PhD thesis, Humboldt Universitaet zu Berlin (Germany)
Dvinskikh D, Gorbunov E, Gasnikov A, Dvurechensky P, Uribe CA (2019) On primal and dual approaches for distributed stochastic convex optimization over networks. In: 2019 IEEE 58th Conference on Decision and Control (CDC), IEEE, pp. 7435– 7440
Dvinskikh D, Tiapkin D (2021) Improved complexity bounds in Wasserstein barycenter problem. In: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1738– 1746
Dvurechenskii P, Dvinskikh D, Gasnikov A, Uribe C, Nedich A (2018) Decentralize and randomize: faster algorithm for wasserstein barycenters. Advances in Neural Information Processing Systems 31
Flamary R, Courty N, Gramfort A, Alaya MZ, Boisbunon A, Chambon S, Chapel L, Corenflos A, Fatras K, Fournier N, Gautheron L, Gayraud NTH, Janati H, Rakotomamonjy A, Redko I, Rolet A, Schutz A, Seguy V, Sutherland DJ, Tavenard R, Tong A, Vayer T (2021) Pot: Python optimal transport. J Mach Learn Res 22(78):1–8
Gasnikov AV, Gasnikova E, Nesterov YE, Chernov A (2016) Efficient numerical methods for entropy-linear programming problems. Comput Math Math Phys 56(4):514–524
Gorbunov E, Rogozin A, Beznosikov A, Dvinskikh D, Gasnikov A (2022). In: Nikeghbali A, Pardalos PM, Raigorodskii AM, Rassias MT (eds) Recent Theoretical Advances in Decentralized Distributed Convex Optimization. Springer, Cham, pp 253–325. https://doi.org/10.1007/978-3-031-00832-0_8
Kantorovich L (1942) On the translocation of masses. (Doklady) Acad Sci URSS (N.S.) 37:199–201
Kovalev D, Gasanov E, Gasnikov A, Richtarik P (2021) Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over time-varying networks. Advances in Neural Information Processing Systems 34
Kovalev D, Shulgin E, Richtárik P, Rogozin AV, Gasnikov A (2021) ADOM: accelerated decentralized optimization method for time-varying networks. In: International Conference on Machine Learning, pp. 5784– 5793. PMLR
Krawtschenko R, Uribe CA, Gasnikov A, Dvurechensky P (2020) Distributed optimization with quantization for computing wasserstein barycenters. arXiv preprint arXiv:2010.14325
Kroshnin A, Tupitsa N, Dvinskikh D, Dvurechensky P, Gasnikov A, Uribe C (2019) On the complexity of approximating wasserstein barycenters. In: International Conference on Machine Learning, PMLR, pp. 3530– 3540
LeCun, Y (1998) The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/
Lemaréchal C, Sagastizábal C (1997) Practical aspects of the moreau-yosida regularization: theoretical preliminaries. SIAM J Optim 7(2):367–385
Li H, Lin Z (2021) Accelerated gradient tracking over time-varying graphs for decentralized optimization. arXiv preprint arXiv:2104.02596
Monge G (1781) Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris
Peyré G, Cuturi M et al (2019) Computational optimal transport: with applications to data science. Found Trends® Mach Learn 11(5–6):355–607
Rabin J, Peyré G, Delon J, Bernot M (2011) Wasserstein barycenter and its application to texture mixing. In: International Conference on Scale Space and Variational Methods in Computer Vision, Springer, pp. 435– 446
Rockafellar RT (1997) Convex analysis, vol 11. Princeton University Press, Princeton
Rogozin A, Beznosikov A, Dvinskikh D, Kovalev D, Dvurechensky P, Gasnikov A (2021) Decentralized distributed optimization for saddle point problems. arXiv preprint arXiv:2102.07758
Rogozin A, Bochko M, Dvurechensky P, Gasnikov A, Lukoshkin V (2021) An accelerated method for decentralized distributed stochastic optimization over time-varying graphs. In: 2021 60th IEEE Conference on Decision and Control (CDC), pp. 3367– 3373 . https://doi.org/10.1109/CDC45484.2021.9683110
Staib M, Claici S, Solomon JM, Jegelka, S (2017) Parallel streaming wasserstein barycenters. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 2647– 2658. http://papers.nips.cc/paper/6858-parallel-streaming-wasserstein-barycenters.pdf
Uribe CA, Dvinskikh D, Dvurechensky P, Gasnikov A, Nedić A (2018) Distributed computation of Wasserstein barycenters over networks. In: 2018 IEEE Conference on Decision and Control (CDC), IEEE, pp. 6544– 6549
Uribe CA, Lee S, Gasnikov A, Nedić A (2020) A dual approach for optimal algorithms in distributed optimization over networks. In: 2020 Information Theory and Applications Workshop (ITA), IEEE, pp. 1– 37
Villani C (2009) Optimal transport: old and new, vol 338. Springer, Cham
Wu X, Lu J (2019) Fenchel dual gradient methods for distributed convex optimization over time-varying networks. IEEE Trans Autom Control 64(11):4629–4636. https://doi.org/10.1109/TAC.2019.2901829
Acknowledgements
The authors are grateful to Alexander Rogozin for his feedback on the manuscript. This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.
Author information
Authors and Affiliations
Contributions
OY carried out the research and wrote all Sections except for introduction and the forth one, MP performed all the numerical experiments, AG proposed the main ideas and wrote Section 4, PD supervised the research and wrote the introduction, DK advised on possible methods. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A ADOM and its assumptions
The state-of-the-art numerical computation method for time-varying networks, called ADOM, is developed in Kovalev et al. (2021) and this subsection presents its main objects. It has natural restrictions on the class of suitable problems and, e.g., the Wasserstein barycenter problem lies beyond the requirements of this algorithm. So we modify ADOM to solve more general optimization problems with restrictions. For the sake of consistency, we slightly change the original notation and adduce below the results from Kovalev et al. (2021).
In Kovalev et al. (2021), the optimization problem with the consensus condition is
where functions \(h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) are assumed to be smooth and strongly convex. Problem (10) is equivalent to the following:
where \(H^*\) is the Fenchel transform of the function H and \({\mathcal {R}}^{\perp }\) is the orthogonal complement of \({\mathcal {R}}\), that exists since \(S = {\mathbb {R}}^d\) here.
Theorem 3
(Kovalev et al. 2021, Theorem 1) Let functions \(h_i:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be L smooth and \(\mu\) strongly convex, \(\textbf{x}^*\) be the solution of the optimization problem (10), \(\textbf{W}_n\) be a communication matrix at the n-th iteration satisfying Assumption 1. Set parameters \(\alpha , \eta , \theta , \sigma ,\tau\) of Algorithm 2 to \(\alpha = \frac{1}{2\,L}\), \(\eta = \frac{2\lambda _{\min }^{+}\sqrt{\mu L}}{7\lambda _{\max }}\), \(\theta = \frac{\mu }{\lambda _{\max }}\), \(\sigma = \frac{1}{\lambda _{\max }}\), and \(\tau = \frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{\mu }{L}}\). Then there exists \(C>0\), such that for Fenchel conjugate function \(H^*(\textbf{z})\) from (11)
Remark 2
Addressing details of the proof of Theorem 1 of Kovalev et al. (2021) we see that there is a particular choice of the constant C, namely
It means that the actual convergence rate is \(n = {\mathcal {O}}\left( \frac{\lambda _{\max }}{\lambda _{\min }^{+}}\sqrt{\frac{L}{\mu }}\ln \frac{1}{\mu ^2\varepsilon }\right)\).
B Proof of Theorem 1
All the arguments below are applied under assumptions of Theorem 1, i.e. we assume that \(S\subset {\mathbb {R}}^d\) is a convex set, \(\textbf{x}\in {\mathcal {S}}\) is equivalent to \([\textbf{x}]_i\in S\) for all \(i=1,\ldots ,m\), functions \(f^{\gamma }_i:S\rightarrow {\mathbb {R}}\) are \(\gamma\) strongly convex, and the output of Algorithm 1 is \(\textbf{x}^n_{r,\gamma } = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)\). Denote also
1.1 B.1 Derivation of \((H^{r,\gamma })^*\)
In brief, in this subsection we show that functions \(h_i^{r,\gamma }\) from (14) are \(\frac{1}{r}\) smooth, \(\frac{\gamma }{1+r\gamma }\) strongly convex, and such that \(\nabla (H^{r,\gamma })^*\) from Line 3 of Algorithm 1 is the gradient of the conjugate function \((H^{r,\gamma })^*\) of \(H^{r,\gamma } = \sum \nolimits _{i=1}^m h_i^{r,\gamma }\) from (14). Then the consensus condition (4) becomes a corollary of Theorem 3 with \(L = \frac{1}{r}\) and \(\mu = \frac{\gamma }{1+r\gamma }\).
From now on let functions \(h_i^{r,\gamma }:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) and \(H^{r,\gamma }:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}\) be
Define their conjugate as \((h_i^{r,\gamma })^*\) and \((H^{r,\gamma })^*\).
Lemma 1
If functions \(h_i^{r,\gamma }\) and \(H^{r,\gamma }\) are defined by (14), then their Fenchel conjugate functions \((h_i^{r,\gamma })^*\) and \((H^{r,\gamma })^*:({\mathbb {R}}^d)^m \rightarrow {\mathbb {R}}\) are
Moreover, its conjugate \((H^{r,\gamma })^{**}\) coincides with \(H^{r,\gamma }\).
Proof
The definition (14) is similar to Moreau–Yosida smoothing, but the tricky point is that the functions \(f_i^{\gamma }\) are defined on a convex set S instead of the \({\mathbb {R}}^d\). Let us introduce functions \({\tilde{f}}_i^{\gamma }\) with domain \({\mathbb {R}}^d\) as follows:
Such \({\tilde{f}}_i^{\gamma }\) are \(\gamma\) strongly convex as well. Moreover, substitution \({\tilde{f}}_i^{\gamma }\) for \(f_i^{\gamma }\) affect neither primal \(h_i^{r,\gamma }\):
nor \((f_i^{\gamma })^*(z) + \frac{r}{2}\Vert z\Vert ^2_2\):
For each i one can see that \((h_i^{r,\gamma })^* = (f^{\gamma }_i)^*(z) + \frac{r}{2}\Vert z\Vert ^2_2\) is the Fenchel conjugate of \(h_i^{r,\gamma }\) and vice versa. Indeed, for proper, convex and lower semicontinuous \(g_1, g_2:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) we have \((g_1+ g_2)^*(x) = g_1^* \square g_2^*\) and \((g_1 \square g_2)^* = g_1^* + g_2^*\), where \((g_1\square g_2)(x)\) means the convolution \(\inf \{g_1(y) + g_2(x-y) \mid y\in {\mathbb {R}}^d \}\).
Hence the Fenchel conjugate for the function \(H^{r,\gamma }\) will be
In the same way one can see that \(H^{r,\gamma }\) and \((H^{r,\gamma })^{**}\) coincide. \(\square\)
Remark 3
For each i the function \(\left( h^{r,\gamma }_{i}\right) ^*\) from (14) is \(\left( \frac{1}{\gamma }+r\right)\) smooth and r strongly convex by definition, so we have \(h^{r,\gamma }_i = (h^{r,\gamma }_i)^{**}\) being \(\frac{1}{r}\) smooth and \(\frac{\gamma }{1+r\gamma }\) strongly convex. In addition
as stated in Line 3 of Algorithm 1. Then we can apply Algorithm 2 for \(L = r^{-1}\) smooth and \(\mu = \frac{\gamma }{1+r\gamma }\) strongly convex functions \(h^{r,\gamma }_i\) and get the values of \(\nabla (h^{r,\gamma }_i)^*(z)\) as output.
Thus we construct a relaxation \(\min _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})\) of the constrained convex optimization problem \(\min _{\textbf{x}\in {\mathcal {S}}}F^{\gamma }(\textbf{x})\).
Corollary 2
Let a function \(H^{r,\gamma }\) be defined in (14) and let \(\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{\textbf{x}\in {\mathcal {R}}}H^{r,\gamma }(\textbf{x})\). Then applying Algorithm 2 for
we get by Theorem 3
where \(\textbf{x}_{r,\gamma }^n = \nabla (H^{r,\gamma })^*(\textbf{z}_g^n)\) and
Moreover, since \(\textbf{x}^*_{r,\gamma } \in {\mathcal {R}}\), i.e. \([\textbf{x}^*_{r,\gamma }]_i = [\textbf{x}^*_{r,\gamma }]_j\) for all i and j, the consensus condition is approximated as follows
1.2 B. 2 Value bounds on \(H^{r,\gamma }\)
Despite we defined \(h_i^{r,\gamma }\) for all \({\mathbb {R}}^d\), some properties hold true on the initial set S only.
Lemma 2
Let functions \(h_i^{r,\gamma }\) be defined in (14). If \(x\in S\), then for any \(r>0\), for each \(i = 1,\ldots ,m\) we have
Proof
The second inequality directly follows from the definition (14). To prove the first one we recall that \(f^{\gamma }_i\) is \(\gamma\) strongly convex and the following holds:
which reaches its minimum at \(y=\frac{r}{1+r\gamma }\nabla f^{\gamma }_i(x)\) and so equals to
\(\square\)
1.3 B.3 Convergence in argument
Lemma 3 shows convergence in argument in the following sense: if the regularization parameter r tends to zero, the argminimum \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) of \(H^{r,\gamma }\) tends to the argminimum \(\textbf{x}^*_{\gamma }\in {\mathcal {S}}\) of \(F^{\gamma }\). By Corollary 2 we have \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) approximated by \(\textbf{x}^n_{r,\gamma }\in ({\mathbb {R}}^d )^m\) for a sufficient number of iterations n.
Lemma 3
Let \(\textbf{x}^*_{r,\gamma } = \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}\in {\mathcal {R}}} H^{r,\gamma }(\textbf{x})\) for \(H^{r,\gamma }\) defined in (14). Let
If r is such that \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\), then
Proof
Using (18) and strong convexity of \(F^{\gamma }\) and \(H^{r,\gamma }\) we have
Then \(\frac{\gamma }{1+r\gamma } \Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2^2 - \frac{r}{2(1+r\gamma )}mK_{\zeta }^2\le 0\) and hence \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert ^2_2 \le \frac{rm}{2\gamma }K^2_{\zeta }\). \(\square\)
Combining Lemma 3 with Corollary 2 we get the following.
Remark 4
Let \(\zeta >0\) and let \(K_{\zeta }\) be such that (19) holds. If
where \(C_1 = \frac{(1+r\gamma )^2}{2\gamma ^2}\), then both \(\Vert \textbf{x}^*_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\) and \(\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\) hold.
1.4 B.4 Value approximation
Let \(\textbf{x}^*_{r,\gamma }\in {\mathcal {R}}\) be the only argminimum of \(H^{r,\gamma }\) on the consensus space \({\mathcal {R}}\), i.e.
In order to prove the value approximation (5) let us separate it into parts and estimate each of them:
The last addend is negative and can be eliminated:
The rest are estimated in Lemmas 4 and 5 under additional assumptions.
Lemma 4
Let \(\Vert \textbf{x}^n_{r,\gamma }- \textbf{x}^*_{\gamma }\Vert _2\le \zeta\). If (19) holds, then
Proof
We cannot declare a uniform K instead of \(K_{\zeta }\) because \(F^{\gamma }\) is not smooth. Nonetheless, assuming \(\textbf{x}^{n}_{r,\gamma }\) belong to \(\zeta\)-neighborhood of \(\textbf{x}^*_{\gamma }\), we immediately obtain from (18) and (19) that
\(\square\)
Lemma 5
Let (19) holds. Then
Proof
By \(\frac{m}{r}\) smoothness of \(H^{r,\gamma }\)
where \(\textbf{z}^{\infty }_{g}\) is the limit of \(\textbf{z}_g^n\) and so it is the argminimum of \((H^{r,\gamma })^*\) on \({\mathcal {R}}^{\perp }\). By (17) we have
Let us introduce an orthogonal projection matrix \(\textbf{P}\) onto the subspace \({\mathcal {R}}^{\perp }\), i.e., it holds \(\textbf{P}v = \mathop {\mathrm {arg\,min}}\limits _{z \in {\mathcal {R}}^{\perp }} \{v - z\}\) for an arbitrary \(v \in ({\mathbb {R}}^d)^n\). Then matrix \(\textbf{P}\) is
where \(\textbf{I}_n\) denotes \(n\times n\) identity matrix, \(\textbf{1}_n = (1,\ldots ,1)\in {\mathbb {R}}^n\), and \(\otimes\) is a Kronecker product. Note that \(\textbf{P}^{\top }\textbf{P}= \textbf{P}\).
Since \(\textbf{z}_g^{\infty }\in {\mathcal {R}}^{\perp }\) and \(\textbf{x}^{*}_{r,\gamma }\in {\mathcal {R}}\), the first part simplifies to \(\langle \textbf{z}_g^{\infty }, \textbf{P}\nabla (H^{r,\gamma })^*(\textbf{z}_g^{n}) \rangle\). We may use Lemma 2 in Kovalev et al. (2021) to get the following estimation
As \(\textbf{z}^{n+1}_f\) is a non-optimal point of Algorithm 1, this is not greater than
and the latter ones follow from the \(\frac{m(1+r\gamma )}{\gamma }\) smoothness of \((H^{r,\gamma })^*\) and from the fact that the proof of (Kovalev et al. 2021, Theorem 1) actually covers the following chain of inequalities:
By our assumption \(\Vert \textbf{z}^{\infty }_{g}\Vert _2 = \Vert \nabla H^{r,\gamma }(x_{r,\gamma }^*)\Vert _2< \sqrt{m}K_{\zeta }\). Thus, we obtain
\(\square\)
1.5 B.5 Final compilation
This section completes the proof of Theorem 1 and shows Remark 1.
Recall that where \(C_1 = \frac{(1+r\gamma )^2}{2\gamma }\) and
By Remark 4 and Lemmas 4, 5 we see that \(F^{\gamma }(\textbf{x}^{n}_{r,\gamma }) - F^{\gamma }(\textbf{x}^*_{\gamma })< \varepsilon\) if
Let \(\zeta =\sqrt{\varepsilon / \gamma }\) and let \(r\le \frac{\varepsilon }{2mK^2_{\zeta }}\). Then (27) holds. If (28) fulfills, then (26) follows from (27) and (28) as \(\sqrt{\frac{rm}{2\gamma }}K_{\zeta }\le \sqrt{\frac{\varepsilon }{2\gamma }} \le \zeta /\sqrt{2}\) and \(\sqrt{C_1}\left( 1-\frac{\lambda _{\min }^{+}}{7\lambda _{\max }}\sqrt{\frac{r\gamma }{1+r\gamma }}\right) ^{n/2}\le \zeta /2\) since \(1\le \sqrt{C_1}\le C_1\le C_2\) and \(\varepsilon \le \sqrt{\varepsilon /\gamma } = \zeta\). Thus, it suffices to assume
So \(\varepsilon\) approximation requires a number of iteration
C Proof of Theorem 2
To prove Theorem 2 we combine proved Theorem 1 with features of the entropy regularization of the Wasserstein barycenter problem.
1.1 C.1 Entropy regularized WB problem
Recall that for a fixed cost matrix M we define the set of transport plans as
and Wasserstein distance between two probability distributions p and q as
The entropy regularized (or smoothed) Wasserstein distance is defined as
where \(\gamma >0\) and
So it seeks to minimize the transportation costs while maximizing the entropy. Moreover \({\mathcal {W}}_{\gamma }(p,q)\rightarrow {\mathcal {W}}(p,q)\) as \(\gamma \rightarrow 0\).
Then the convex optimization problem (7) can be relaxed to the following \(\gamma\) strongly convex optimization problem
where \({\mathcal {W}}_{\gamma ,q_i}(p) = {\mathcal {W}}_{\gamma }(q_i, p)\). The argminimum of (31) is called the uniform Wasserstein barycenter (Agueh and Carlier 2011; Cuturi and Doucet 2014) of the family of \(q_1,\ldots , q_m\). Moreover, problem (31) admits a unique solution and approximates the unregularized WB problem as follows.
Remark 5
Let \(\gamma \le \frac{\varepsilon }{4}\ln d\). If vectors \(\hat{p}_i\in S_1(d)\) are such that
then
Indeed, as entropy is bounded we have \({\mathcal {W}}_{q_i}(p)\le {\mathcal {W}}_{\gamma , q_i}(p)\le {\mathcal {W}}_{q_i}(p) + 2\gamma \ln d\) for all i and p. Then, for \(p^* = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{q_i}(p)\) and \(p^*_{\gamma } = \mathop {\mathrm {arg\,min}}\nolimits \nolimits _{p\in S_1(d)}\sum \nolimits _{i=1}^{m}{\mathcal {W}}_{\gamma , q_i}(p)\) it holds that
1.2 C.2 Legendre transforms
One particular advantage of entropy regularization of the Wasserstein distance is that it yields closed-form representations for the dual function \({\mathcal {W}}^*_{\gamma , q}(\cdot )\) and for its gradient. Recall that the Fenchel-Legendre transform of (29) is defined as
Theorem 4
((Cuturi and Peyré 2015, Theorem 2.4)) For \(\gamma >0\), the Fenchel-Legendre dual function \({\mathcal {W}}^*_{\gamma ,q}(z)\) is differentiable
and its gradient \(\nabla {\mathcal {W}}^*_{\gamma ,q}(z)\) is \(1/\gamma\)-Lipschitz in the 2-norm with
where \(z \in {\mathbb {R}}^n\) and for brevity we denote \(\alpha = \exp ( {z}/{\gamma })\) and \({\mathcal {K}}= \exp \left( {-M}/{\gamma }\right)\).
Notice that to get back and obtain the approximated barycenter we can employ the following result (with \(\lambda _i = 1\)).
Theorem 5
(Cuturi and Peyré (2015), Theorem 3.1) The barycenter \(p^*\) solving (31) satisfies
where the set of \(z^*_i\) constitutes any solution of any smoothed dual WB problem:
Thus we can apply Theorem 1 for the problem (31) with explicitly defined \(\nabla {\mathcal {W}}^*_{\gamma , q_i}\) and obtain \(\textbf{x}^n_{r,\gamma }\) that satisfies
By Remark 5 it proves
for \(C=\frac{1}{2}C_2 = \frac{(1+r\gamma )mK_{\zeta }}{2\sqrt{2}\gamma }\sqrt{\frac{\lambda _{\max }}{\lambda _{\min }^{+}}} + \frac{(1+r\gamma )^2}{8r\gamma ^2}\).
1.3 C.3 Parameter estimation
It remains to assign \(\zeta >0\) and \(K = K_{\zeta }\) satisfying (25). Due to Assumption 2 such \(\zeta\) and K exist.
Proposition 1
Let a set \(\{q_i\}_{i=1}^m\) satisfies Assumption 2, let \(p^*_{\gamma }\) be the uniform Wasserstein barycenter of \(\{q_i\}_{i=1}^m\), and let \(\zeta \in \left( 0, \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\right)\). For each \(i = 1,\ldots ,m\) the norm of the gradient \(\Vert \nabla {\mathcal {W}}_{\gamma ,q_i}(\cdot )\Vert _2^2\) is uniformly bounded over \(\{p\in S_1(d)\mid \Vert p-p^*_{\gamma } \Vert _2^2 \le \zeta \};\) and the bound \(K_{\rho }\) is given in (35) for \(\rho \le \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}-\zeta .\)
We obtain Proposition 1 as a combination of Lemma 6 from Bigot et al. (2019) and proved below Lemma 7.
Lemma 6
(Bigot et al. (2019), Lemma 3.5) For any \(\rho \in (0,1)\), \(q\in S_1(d)\), and \(p\in \{x\in S_1(d)\mid \min _l x_l\ge \rho \}\) there is a bound: \(\Vert \nabla {\mathcal {W}}_{\gamma , q}(p) \Vert ^2_2\le K_{\rho }\), where
Lemma 7
Let a set \(\{q_i\}_{i=1}^m\) satisfies Assumption 2, let \(p^*_{\gamma }\) be the uniform Wasserstein barycenter of \(\{q_i\}_{i=1}^m\). All components k of \(p^*_{\gamma }\) have a uniform positive lower bound: \([p^*_{\gamma }]_k\ge \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\).
Proof
Let \(X^*_i\) denote the optimal transport plan between \(p^*_{\gamma }\) and \(q_i\). Assume the contrary: there is k such that \([p^*_{\gamma }]_k < \min \{\frac{1}{e}, \ \min _{i,l} [q_i]_l\}\). Then there is another component n such that \([p^*_{\gamma }]_n>\min _i [q_i]_n > \min _{i,l} [q_i]_l\). Consider the vector p that consists of \([p]_i = [p^*_{\gamma }]_i\) except for the components \([p]_n = [p^*_{\gamma }]_n+\delta\) and \([p]_l = [p^*_{\gamma }]_l-\delta\), where \(\delta >0\) is less than \(\min _{i,a\not =b}[X_i^*]_{a,b}\) of the optimal transport plans \(X^*_i\) between \(p^*_{\gamma }\) and \(q_n\). Because of the entropy, all these optimal transport plans contain only positive non-diagonal elements, so such a \(\delta\) exists.
Construct now non-optimal transport plans between p and each of \(q_i\) in order to get the contradiction with the assumption. Initially we have \({\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma }) = \langle C, X^*_i \rangle - \gamma X^*_i \ln X^*_i\). Consider the matrix \(X_i\) that differs from \(X^*_i\) only at four elements:
Then \(X_i\) is a transport plan between p and \(q_i\) since its elements are positive and also \(X_i \textbf{1}= p\) and \(X_i^{\top }\textbf{1}= q_i\). Using the monotonicity of entropy on the interval \((0,\frac{1}{e})\) and the assumption that diagonal elements of the cost matrix C are zero, we get for each i:
The obtained inequalities \({\mathcal {W}}_{\gamma ,q_i}(p)<{\mathcal {W}}_{\gamma ,q_i}(p^*_{\gamma })\) contradict to the fact that \(p^*_{\gamma }\) is the barycenter; this proves the lemma. \(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yufereva, O., Persiianov, M., Dvurechensky, P. et al. Decentralized convex optimization on time-varying networks with application to Wasserstein barycenters. Comput Manag Sci 21, 12 (2024). https://doi.org/10.1007/s10287-023-00493-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10287-023-00493-9