1 Introduction

Wasserstein barycenters are an increasingly popular application of optimal transport in data science [1, 2]. They have nice mathematical properties, since they are the Fréchet means with respect to the Wasserstein distance [3,4,5]. Their applications range from mixing textures [6, 7], stippling patterns and bidirectional reflectance distribution functions [8], or color distributions and shapes [9] over averaging of sensor data [10] to Bayesian statistics [11], just to name a few. For further reading, we refer to the surveys [12, 13].

Unfortunately, Wasserstein barycenters are in general hard to compute [14]. Many algorithms restrict the support of the solution to a fixed set and minimize only over the weights. Such methods include projected subgradient [15], iterative Bregman projections [16], (proximal) algorithms based on the latter [17], interior point methods [18], Gauss-Seidel based alternating direction of multipliers [19], multi-marginal Sinkhorn algorithms and its accelerated variants [20], debiased Sinkhorn barycenter algorithms [21], methods using the Wasserstein distance on a tree [22], accelerated Bregman projections [23] and methods based on mirror proximal maps or on a dual extrapolation scheme [24], among others. While iterative Bregman projections are a standard benchmark that are hard to beat in terms of simplicity and speed, fixed-support methods applied on a grid suffer from the curse of dimensionality.

On the other hand, barycenters without such restriction are called free-support barycenters. This approach can overcome the curse of dimensionality, since the optimal solution is sparse. Free-support barycenters can be computed directly from the solution of the closely related multi-marginal optimal transport (MOT) problem. The latter was originally introduced in [25] in the continuous setting for squared Euclidean costs and further generalized in various ways, e.g., to entropy regularized [26, 27] and unbalanced variants with non-exact marginal constraints [28]. The solution to MOT can be obtained by solving a linear program (LP) that unfortunately scales exponentially in N, however [29]. Although there are exact polynomial-time methods for measures on \({\mathbb {R}}^d\) for fixed d [30], see also LP-based methods in [29, 31, 32], these are not necessarily fast in practice and rather involved to implement. A remedy is to resort to approximative approaches, which include so far a Newton-approach that iteratively alternates between optimizing over the weights and supports [15], another LP-based method [33], an inexact proximal alternating minimization method [34], an iterative stochastic algorithm [35] and the iterative swapping algorithm [2]. A free-support barycenter method based on the Frank–Wolfe algorithm is given in [36]. Another method in [37] computes continuous barycenters using another way of parameterizing them. For approaches for MOT similar to this paper, see [38]. Further speedups can be obtained by subsampling the given measures [39] or dimensionality reduction of the support point clouds [40].

Despite the plethora of literature, many algorithms with low theoretical computational complexity or high accuracy solutions are rather involved to implement. This impedes its actual usage and further research in practice. To the best of our knowledge, there does not exist an algorithm that fulfills the following list of desiderata in the free-support setting:

  • simple to implement,

  • sharp theoretical error bounds,

  • sparse solutions, and

  • good numerical results in practice.

The purpose of this paper is to show that all of these points can be achieved using one iteration of a simple well-known fixed-point algorithm, which only requires some off-the-shelve two-marginal OT solver as ingredient to its otherwise easy implementation. Here we consider the cases \(p=2\) and \(p=1\), where the latter has received less attention in the literature so far. One such fixed-point iteration consists in computing optimal transport plans from a given measure to the input measures and pushing each atom to the p-barycenter of its target locations. For the cost of \(N-1\) OT plans, this yields a relative error bound of N, or a 2-approximation, respectively, when averaging over these results, which requires to solve \(N(N-1)/2\) OT problems. The key to these theoretical bounds is based on the observation that the they are already fulfilled for the input measures or their mixture, respectively, which we choose as initialization. On the other hand, we show that the aforementioned fixed-point iteration guarantees to at least retain the current approximation quality, but improves it considerably in practice in the first step.

Note that other algorithms with an upper error bound of 2 have been proposed in [33] for \(p=2\). The basic algorithm produces a barycenter with support \(\cup _{i=1}^N\text {supp}(\mu ^i)\) by solving an LP over its weights. However, while this support choice leads to bad approximations in practice (consider, e.g., two distinct Dirac measures as input), for a merely theoretical 2-approximation, no computation is necessary as mentioned above. On the other hand, the implementation and proofs of the other algorithms in that paper with better results in practice are rather involved.

In view of the hardness of the Wasserstein barycenter problem [14], it is clear that the derived relative error bounds cannot be close to 1 for every set of inputs, unless P = NP. However, the improvement made by one iteration is straightforward to evaluate in the proposed algorithms, such that it can output relative error bounds specific to the given problem without knowing the optimal solution. We observe these resulting improved bounds to be close to 1 in the numerical experiments.

This paper is organized as follows: We introduce the Wasserstein barycenter problem and our notation in Sect. 2. In Sect. 3, we state the algorithms considered in this paper. In Sect. 4, we analyze their worst-case relative error. In Sect. 5, we provide a comparison with other algorithms on a synthetic data set, a numerical exploration of Wasserstein-1 barycenters, and two applications of the discussed framework. Concluding remarks are given in Sect. 6.

2 Wasserstein barycenter problem

In the following, we denote by \(\Vert \cdot \Vert \) the Euclidean norm on \({\mathbb {R}}^d\) and by \({\mathcal {P}}({\mathbb {R}}^d)\) the space of probability measures on \({\mathbb {R}}^d\). Let \(1\le p < \infty \). For two discrete measures

$$\begin{aligned} \mu ^1 = \sum _{k=1}^{n_1} \mu ^1_k \delta (x^1_k), \quad \mu ^2 = \sum _{l=1}^{n_2} \mu ^2_l \delta (x^2_l), \end{aligned}$$

the Wasserstein-p distance is defined by

$$\begin{aligned} {\mathcal {W}}_p^p(\mu ^1, \mu ^2) = \min _{\pi \in \Pi (\mu ^1, \mu ^2)} \langle c_p, \pi \rangle , \end{aligned}$$

where \(\langle c_p, \pi \rangle = {\smash {\int _{{\mathbb {R}}^d\times {\mathbb {R}}^d}}}c_p \,\textrm{d}\pi \) with \(c_p(x,y) {:}{=}\Vert x-y\Vert ^p\) and \(\Pi (\mu ^1, \mu ^2)\) denotes the set of probability measures on \({\mathbb {R}}^d \times {\mathbb {R}}^d\) with marginals \(\mu ^1\) and \(\mu ^2\). The above optimization problem is convex, but can have multiple minimizers \(\pi \).

In this paper, we are given N discrete probability measures \(\mu ^i\in {\mathcal {P}}({\mathbb {R}}^d)\) supported at \(\text {supp}(\mu ^i) = \{ x^i_1, \dots , x^i_{n_i}\}\), where the \(x^i_l\) are pairwise different for every i, i.e.,

$$\begin{aligned} \mu ^i= \sum _{l=1}^{n_i} \mu ^i_l \delta (x^i_l),\quad i=1, \dots , N. \end{aligned}$$

Let \(\Delta _{N} \,{:=}\,\{\lambda \in (0,1)^N: \sum _{i=1}^N\lambda _i=1\}\) denote the open probability simplex. For given weights \(\lambda = (\lambda _1,\ldots ,\lambda _N) \in \Delta _{N}\), we are interested in the computation of Wasserstein barycenters, which are the solutions to the optimization problem

$$\begin{aligned} \min _{\nu \in {\mathcal {P}}({\mathbb {R}}^d)} \Psi _p(\nu ) , \qquad \Psi _p(\nu ) {:}{=}\sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p(\nu , \mu ^i). \end{aligned}$$
(2.1)

The following theorem restates important results from [41, Prop. 3], which connects barycenter problems with what is nowadays known as multi-marginal optimal transport, as well as [29, Prop. 1, Thm. 2] and [42, Thm. 1] in our notation.

Theorem 2.1

The barycenter problem (2.1) has at least one optimal solution \({{\hat{\nu }}}\). Every optimal solution \({{\hat{\nu }}}\) fulfills

$$\begin{aligned} \text {supp}({{\hat{\nu }}}) \subseteq \Big \{ \sum _{i=1}^N\lambda _ix^i : x^i\in \text {supp}(\mu ^i), \, i=1, \dots , N \Big \}. \end{aligned}$$
(2.2)

Moreover, there exists an optimal solution \({{\hat{\nu }}}\), such that

$$\begin{aligned} \#\text {supp}({{\hat{\nu }}}) \le \sum _{i=1}^Nn_i - N + 1. \end{aligned}$$
(2.3)

Proof

Note that (2.2) is straightforward to obtain from the relation to multi-marginal optimal transport [41, Prop. 3]. In the special case \(p=2\), the results from [29], in particular (2.3), can readily be generalized to arbitrary \(\lambda \in \Delta _N\). For general \(p\ge 1\) and barycenter problems with even more general cost functions, this follows from sparsity of multi-marginal optimal transport recently shown in [42, Thm. 1] in combination with [41, Prop. 3]. \(\square \)

In particular, the theorem says that finding optimal Wasserstein barycenters is a discrete optimization problem over the weights of its finite support, which is contained in the convex hull of the supports of the \(\mu ^i\). However, the number of possible support points scales exponentially in N.

3 Algorithms for barycenter approximation

In this section, after motivating the main framework considered in this paper in Sect. 3.1, we discuss two more concrete configurations of it in Sects. 3.2 and 3.3.

3.1 Motivation

In its core, the algorithms in this paper approximate barycenters by “averaging optimal transport plans” from a particular reference measure to the input measures in some sense. This approach is well-known and comes in various flavors in the literature. For example, it can be viewed through the lens of generalized geodesics in Wasserstein spaces [43] and recent literature on linear optimal transport and relatives [44,45,46,47,48]. On the other hand, in [15], one of the first papers on the numerical approximation of Wasserstein barycenters, the idea is presented as a Newton iteration. The same iteration is analyzed in the continuous setting in [49], and it can be used as a characterization of Wasserstein barycenters in terms of fixed points of this procedure, even for uncountably many input measures [50]. See also [51] for this algorithm in the context of weak optimal transport.

Let us define the averaging of transport plans more precisely.

Definition 3.1

Given a discrete measure \(\nu = \sum _{k=1}^{n_\nu }\nu _k \delta (y_k) \in {\mathcal {P}}({\mathbb {R}}^d)\) and transport plans

$$\begin{aligned} \pi ^i {:}{=}\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \delta ( y_k, x_l^i) \in \Pi (\nu , \mu ^i), \quad i=1, \dots , N, \end{aligned}$$

set \(\pi =(\pi ^1, \dots , \pi ^N)\) and let for \(k=1, \dots , n_\nu \), \(p\ge 1\), the barycentric map \(M_{\lambda , \pi }^p:\text {supp}(\nu )\rightarrow {\mathbb {R}}^d\) be defined as

$$\begin{aligned} m_k = M_{\lambda , \pi }^p(y_k) {:}{=}\mathop {{{\,\textrm{argmin}\,}}_{m\in {\mathbb {R}}^d}}\sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i} \frac{\pi ^i_{k, l}}{\nu _k} \Vert m - x_l^i \Vert ^p. \end{aligned}$$

Furthermore, we define the mapping

$$\begin{aligned} G_{\lambda , \pi }^p(\nu ) {:}{=}\sum _{k=1}^{n_\nu } \nu _k \delta (m_k). \end{aligned}$$
(3.1)

That is, each atom \(y_k\) in the measure \(\nu \) is pushed to the weighted barycenter \(m_k\) of its target locations \(x_l^i\), where the weights are given by the \(\lambda _i\) and the weights of the source locations as given by the transport plans \(\pi ^i\), relative to the corresponding transported mass \(\nu _k\).

Note that for \(p=2\), the map \(M_{\lambda , \pi }^p\) is the classical mean, whereas for \(p=1\), it is called geometric median. It is uniquely defined, whenever the points are not collinear, see [52]. Otherwise, in case of ambiguity, the set of minimizers is a one-dimensional line segment, of which we choose the midpoint. However, unlike in the case \(p=2\), there is no explicit formula or exact algorithm involving only arithmetic operations and k-th roots to compute \(M_{\lambda , \pi }^p\), see [53]. Nevertheless, the geometric median can be approximated using Weiszfeld’s algorithm, which consists mainly in the fixed point iteration

$$\begin{aligned} m^{(k+1)} = \Big ( \sum _{i=1}^N\frac{\lambda _i}{\Vert x_i-m^{(k)}\Vert } \Big )^{-1}\Big ( \sum _{i=1}^N\frac{\lambda _i x_i}{\Vert x_i-m^{(k)}\Vert } \Big ), \end{aligned}$$

with a particular choice of the starting point \(m^{(0)}\) that guarantees \(m^{(k)} \ne x_i\) for all \(i=1, \dots , N\) and \(k\ge 0\). This method is a gradient descent method and accelerated methods are also available. For more details, we refer to the survey [52].

Next, we comment on the relation of \(G_{\lambda , \pi }^p\) to Wasserstein barycenters. In the most important case \(p=2\), formula (3.1) simplifies when the transport plans are non-mass-splitting, that is, for every \(i=1,\dots , N\), each \(\pi ^i\) is supported on the graph of some transport maps \(T^i:\text{ supp }(\nu )\rightarrow \text{ supp }(\mu ^i)\) with \(T^i_\# \nu = \mu ^i\). In that case, \(G_{\lambda , \pi }^p\) pushes \(\nu \) forward by the average of the transport maps,

$$\begin{aligned} G_{\lambda , \pi }^p= \Big ( \sum _{i=1}^N\lambda _iT^i \Big )_\#. \end{aligned}$$

This is called McCann interpolation for \(N=2\). In the nondiscrete setting, if \(\nu \) is absolutely continuous, then optimal transport maps \(T^i\) exist by Brenier’s theorem, see e.g. [54, Thm. 1.22]. In fact, [49] discusses the following fixed-point iteration for approximate barycenter computation:

  1. 1.

    Compute the optimal transport maps \(T^i\) from \(\nu \) to \(\mu ^i\), \(i=1, \dots , N\)

  2. 2.

    Update \(\nu \leftarrow \Big ( \sum _{i=1}^N\lambda _iT^i \Big )_\# \nu \), repeat.

It is shown that if there is a unique fixed point, then this is the optimal barycenter and the iteration converges, which is the case for, e.g., Gaussian measures. The convergence is numerically observed to be very fast, and in certain special cases, it is reached already in one iteration. Taking the geometric structure of the Wasserstein space into account, see, e.g., [43], the fixed-point procedure above is the the typical algorithm for computing Fréchet means on manifolds [3,4,5].

This motivates the algorithms presented in this paper, which consist in deliberately performing only the first iteration of the fixed-point procedure above. More precisely, the approximate barycenters are of the form \({{\tilde{\nu }}}=G_{\lambda , \pi }^p(\nu )\) for certain plans \(\pi ^i\) and initial measures \(\nu \). We found that this yields the best tradeoff between speed and accuracy in practice, since the error improvement of further iterations is typically rather small.

We illustrate this claim by the following numerical example. We create \(N=10\) discrete measures \(\mu ^i=\sum _{l=1}^n \frac{1}{n} \delta (x_l^i)\), \(i=1, \dots , N\), with \(n=50\) points each, which we sample uniformly from the unit disk and center to have mean zero. We initialize with \(\nu ^{(0)}{:}{=}\mu ^1\) and perform the iteration above until convergence after 5 iterations, that is, \(\nu ^{(6)}=\nu ^{(5)}\). Optimal transport maps \(T^i\) always exist here, since we have empirical measures with the same number of atoms. In Fig. 1, we show the cost \(\Psi _2(\nu ^{(k)})\) with respect to k and compare to the cost \(\Psi _2({{\hat{\nu }}})\) of an optimal barycenter \({{\hat{\nu }}}\), that is, a solution of (2.1). While the error \(\Psi _2(\nu ^{(k)})-\Psi _2({{\hat{\nu }}})\) is decreased in the first step by \(83.2\%\), the improvement in the second iteration is only \(37\%\) of the remaining error and decreases even further until convergence to a suboptimal solution. Moreover, the absolute cost decrease \(\Psi _2(\nu ^{(2)})-\Psi _2(\nu ^{(1)})\) in the second iteration is only \(7.5\%\) of the decrease \(\Psi _2(\nu ^{(1)})-\Psi _2(\nu ^{(0)})\) of the first iteration. This also makes sense intuitively, since it seems reasonable that the largest improvement is gained by pushing every support point from some rather arbitrary initialization to the barycenter of several other reasonably chosen support points of the \(\mu ^i\).

Fig. 1
figure 1

Barycenter cost \(\Psi _2(\nu ^{(k)})\) over the number of iterations k in blue. The black dashed line depicts the optimal cost \( \Psi _{2} (\hat{\nu })\)

Furthermore, from a theoretical standpoint, there are simple examples with convergence after one iteration for both presented algorithms below, such that we cannot expect in general to gain any improvements using more than one iteration either. In particular, as in the numerical example above, there is no way to guarantee convergence to the optimum of this iterative procedure in general, which is the case for any algorithm due to the NP-hardness of the problem [14].

3.2 Reference algorithm

In this section, we choose \(\nu =\mu ^j\) as initialization. For simplicity of notation, reorder the measures such that \(j=1\). That is, we compute \(N-1\) optimal transport plans

$$\begin{aligned} \pi ^i = \sum _{k=1}^{n_1} \sum _{l=1}^{n_i} \pi ^i_{k, l} \delta ( x^1_k, x_l^i) \in {{\,\textrm{argmin}\,}}_{\pi \in \Pi (\mu ^1, \mu ^i)} \langle c_p, \pi \rangle , \quad i=2, \dots , N \end{aligned}$$

and consider the approximate barycenter defined by

$$\begin{aligned} {{\tilde{\nu }}} = \sum _{k=1}^{n_1} \mu ^1_k \delta (M_{\lambda , \pi }^p(x^1_k)). \end{aligned}$$
(3.2)

Note that the support of \({{\tilde{\nu }}}\) given by (3.2) is very sparse, since it contains only \(n_1\) elements, which is an interesting feature from a computational point of view.

For \(p=2\), if the input measures are given in terms of matrices \(X^i\in {\mathbb {R}}^{n_i\times d}\), where the rows are the support points, and the corresponding mass weights are the vectors \(\mu ^i\in {\mathbb {R}}^{n_i}\) for all \(i=1, \dots , N\), then computing the support matrix \(Y\in {\mathbb {R}}^{n_1\times d}\) of (3.2) can be written as an average of N matrix products as outlined in Algorithm 1.

figure a

In the case \(p=1\), since there is no closed form for \(M_{\lambda , \pi }^p(x_k^1)\), we have to make slight modifications to the algorithm in that case.

Remark 3.2

Let \(f_{x, \lambda }(m) {:}{=}\sum _{i=1}^N\lambda _i\Vert x_i - m\Vert \) and \({{\hat{m}}} = {{\,\textrm{argmin}\,}}_{m\in {\mathbb {R}}^d} f(m)\). We show that Weiszfeld’s algorithm is guaranteed to approximate \(f({{\hat{m}}})\) up to a multiplicative factor of \((1+\varepsilon )\) for a certain minimal number of iterations that is explicitly computable. In [52, Thm. 8.2] it is shown for the Weiszfeld iterates \(m^{(k)}\) that

$$\begin{aligned} f(m^{(k)})-f({{\hat{m}}}) \le \frac{M}{k}\Vert m^{(0)}-{{\hat{m}}}\Vert ^2, \end{aligned}$$

with an explicit formula for M, depending only on the \(x_i\) and \(m^{(0)}\). Since f is convex, from \(\nabla f = 0\), a simple calculation shows that \({{\hat{m}}}\) must lie in the convex hull of the \(x_i\). Thus

$$\begin{aligned} \Vert m^{(0)}-{{\hat{m}}}\Vert ^2 \le \max _{i=1, \dots , N} \Vert m^{(0)} - x_i\Vert ^2. \end{aligned}$$

Moreover, by (4.3), it holds

$$\begin{aligned} \sum _{i<j}^N \lambda _i\lambda _j\Vert x_i-x_j\Vert \le \sum _{i=1}^N\lambda _i\Vert x_i - {{\hat{m}}}\Vert = f({{\hat{m}}}), \end{aligned}$$

such that for any given \(\varepsilon > 0\), choosing

$$\begin{aligned} k\ge \frac{M\cdot \max _{i=1, \dots , N} \Vert m^{(0)} - x_i\Vert ^2}{\varepsilon \sum _{i<j}^N \lambda _i\lambda _j\Vert x_i-x_j\Vert } \end{aligned}$$

guarantees that

$$\begin{aligned} f(m^{(k)})&= (f(m^{(k)}) - f({{\hat{m}}})) + f({{\hat{m}}}) \le \frac{M}{k} \Vert m^{(0)}-{{\hat{m}}}\Vert ^2 + f({{\hat{m}}}) \\&\le \frac{M\varepsilon \sum _{i<j}^N \lambda _i\lambda _j\Vert x_i-x_j\Vert }{M\cdot \max _{i=1, \dots , N} \Vert m^{(0)} - x_i\Vert ^2}\Vert m^{(0)}-{{\hat{m}}}\Vert ^2 + f({{\hat{m}}}) \le (1+\varepsilon )f({{\hat{m}}}). \end{aligned}$$

This prepares us to state the reference algorithm for the case \(p=1\), see Algorithm 2.

figure b

3.3 Pairwise algorithm

We will see that in order to achieve better results than the reference algorithm, it is beneficial to “average out” the asymmetry introduced by choosing \(\mu ^1\) as the reference measure in (3.2). Therefore, we choose

$$\begin{aligned} \nu = \sum _{i=1}^N\lambda _i\mu ^i \end{aligned}$$
(3.3)

as initial measure in this section. However, instead of computing optimal plans from \(\nu \) to each \(\mu ^i\), we solve

$$\begin{aligned} \pi ^{ij}\in \mathop {{{\,\textrm{argmin}\,}}_{\pi \in \Pi (\mu ^i, \mu ^j)}} \langle c, \pi \rangle \end{aligned}$$

pairwise for every \(1\le i<j\le N\) and use the transport plans

$$\begin{aligned} \pi ^i = \sum _{j=1}^N\lambda _j \pi ^{ji} \in \Pi (\nu , \mu ^i), \end{aligned}$$
(3.4)

in (3.1), so that our approximate barycenter \({{\tilde{\nu }}}\) with (3.3) and (3.4) reads as

$$\begin{aligned} {{\tilde{\nu }}} = G_{\lambda , \pi }^p(\nu ) = \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_i} \mu ^i_k \delta ( M_{\lambda , (\pi ^{i1}, \dots , \pi ^{iN})}(x^i_k)). \end{aligned}$$
(3.5)

Splitting up the OT computations like this scales better in terms of computational complexity and seems to yield better numerical results in practice. Clearly, we will have

$$\begin{aligned} \#\text {supp}({{\tilde{\nu }}}) \le n_1+\dots +n_N, \end{aligned}$$

that is, \({{\tilde{\nu }}}\) meets practically the same sparsity bound as an optimal solution \({{\hat{\nu }}}\), see (2.3).

Remark 3.3

Note that the inner sum in (3.5) is of the form (3.2). If we denote by \({{\tilde{\nu }}}^i\) the barycenter obtained from the reference algorithm, when \(\mu ^i\) was the reference measure, i.e., permuted to the first position, our approximation (3.5) is simply

$$\begin{aligned} {{\tilde{\nu }}} = \sum _{i=1}^N\lambda _i{{\tilde{\nu }}}^i. \end{aligned}$$

However, since we can choose \(\pi ^{ji}=(\pi ^{ij})^\textrm{T}\), we save half of the necessary OT computations compared to executing the reference algorithm N times.

Algorithm 3 summarizes this approach for \(p=2\) using matrix-vector notation, where \(\odot \) denotes element-wise multiplication and \(\mathbb {1}_d\) denotes a d-dimensional vector of ones. Note that \(\eta \) denotes an upper bound of the relative error for the particular given problem, i.e., it holds that \(\Psi _2({{\tilde{\nu }}})/\Psi _2({{\hat{\nu }}})\le \eta \). This is proven in Sect. 4.

figure c

Again, these matrix-vector computations will not work in the case \(p=1\). Instead, Algorithm 4 outlines the computation of (3.5) using Weiszfeld’s algorithm.

figure d

4 Analysis

In this section, we give worst case bounds for the relative error \(\Psi _p({{\tilde{\nu }}})/\Psi _p({{\hat{\nu }}})\), where \({{\tilde{\nu }}}\) is an approximate barycenter computed by one of the algorithms above, \({{\hat{\nu }}}\) is an optimal barycenter, and \(\Psi _p\) is the objective defined in (2.1). In the proofs, we will use the following basic identities.

Lemma 4.1

For any points \(x_1, \dots , x_N, y\in {\mathbb {R}}^d\), \(\lambda \in \Delta _N\) and \(m{:}{=}\sum _{i=1}^N\lambda _ix_i\), we have the following identities:

$$\begin{aligned} \sum _{i=1}^N \lambda _i \Vert x_i - y\Vert ^2&= \Vert m-y\Vert ^2 + \sum _{i=1}^N \lambda _i \Vert x_i - m\Vert ^2, \end{aligned}$$
(4.1)
$$\begin{aligned} \sum _{i=1}^N\lambda _i\Vert x_i-m\Vert ^2&= \sum _{i<j}^N \lambda _i\lambda _j \Vert x_i - x_j\Vert ^2, \end{aligned}$$
(4.2)
$$\begin{aligned} \sum _{i=1}^N\lambda _i\Vert x_i-y\Vert&\ge \sum _{i<j}^N \lambda _i\lambda _j \Vert x_i - x_j\Vert . \end{aligned}$$
(4.3)

Proof

For (4.1), we set \(z{:}{=}m-y\) to obtain

$$\begin{aligned} \sum _{i=1}^N \lambda _i \Vert x_i - y\Vert ^2&= \sum _{i=1}^N \lambda _i \Vert x_i -m+z\Vert _2^2 = \sum _{i=1}^N \lambda _i \left( \Vert z\Vert ^2 + \Vert x_i-m\Vert ^2 - 2\langle x_i-m, z\rangle \right) \\&= \Vert m-y\Vert ^2 + \sum _{i=1}^N \lambda _i \Vert x_i-m\Vert ^2. \end{aligned}$$

For (4.2), plugging \(y=x_j\) into (4.1), we get

$$\begin{aligned} \sum _{i=1}^N\lambda _i\Vert x_i-x_j\Vert ^2 = \Vert m-x_j\Vert ^2 + \sum _{i=1}^N\lambda _i\Vert x_i-m\Vert ^2. \end{aligned}$$

Weighting this equality with \(\lambda _j\) and summing over \(j=1, \dots , N\), we get

$$\begin{aligned} \sum _{i,j=1}^N\lambda _i \lambda _j \Vert x_i-x_j\Vert ^2&= \sum _{j=1}^N \lambda _j\Vert x_j-m\Vert ^2 + \sum _{j=1}^N\lambda _j \sum _{i=1}^N\lambda _i\Vert x_i-m\Vert ^2 \\&= 2\sum _{i=1}^N\lambda _i\Vert x_i-m\Vert ^2. \end{aligned}$$

Dividing by 2 yields (4.2). For (4.3), note that by the triangle inequality,

$$\begin{aligned} \sum _{i<j}^N \lambda _i\lambda _j \Vert x_i - x_j\Vert&= \frac{1}{2} \sum _{i=1}^N\lambda _i\sum _{j=1}^N\lambda _j \Vert x_i - x_j\Vert \\&\le \frac{1}{2} \sum _{i=1}^N\lambda _i\sum _{j=1}^N\lambda _j (\Vert x_i-y\Vert + \Vert y- x_j\Vert ) \\&= \frac{1}{2} \Big (\sum _{i=1}^N\lambda _i\Vert x_i-y\Vert + \sum _{j=1}^N\lambda _j \Vert y- x_j\Vert \Big ) = \sum _{i=1}^N\lambda _i\Vert x_i-y\Vert . \end{aligned}$$

\(\square \)

In order to upper bound \(\Psi _p({{\tilde{\nu }}})/\Psi _p({{\hat{\nu }}})\), we require a lower bound on \(\Psi _p({{\hat{\nu }}})\).

Proposition 4.2

For any discrete \(\nu \in {\mathcal {P}}({\mathbb {R}}^d)\) and \(p=1, 2\), it holds that

$$\begin{aligned} \Psi _p(\nu ) \ge \sum _{i<j}^N \lambda _i\lambda _j {\mathcal {W}}_p^p(\mu ^i, \mu ^j). \end{aligned}$$
(4.4)

Proof

Let \(p\in \{1, 2\}\) and \(\nu =\sum _{k=1}^{n_\nu } \nu _k \delta (y_k)\) be arbitrary. Take \(\pi ^i \in {{\,\textrm{argmin}\,}}_{\pi \in \Pi (\nu , \mu ^i)} \langle c_p, \pi \rangle \), then by definition,

$$\begin{aligned} \Psi _p(\nu ) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p(\nu , \mu ^i) = \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l_i=1}^{n_i} \pi ^i_{k, l_i} \Vert y_k-x^i_{l_i}\Vert ^p. \end{aligned}$$

Since it holds for any \(i=1, \dots , N\) and \(k=1, \dots , n_\nu \) that

$$\begin{aligned} \sum _{l_1, \dots , l_{i-1}, l_{i+1}, \dots , l_N} \frac{\pi ^1_{k, l_1}\dots \pi ^{i-1}_{k, l_{i-1}}\pi ^{i+1}_{k, l_{i+1}}\dots \pi ^N_{k, l_N}}{\nu ^{N-1}_k} = 1, \end{aligned}$$

we get

$$\begin{aligned} \Psi _p(\nu )&= \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu } \sum _{l_1, \dots , l_N} \frac{\pi ^1_{k, l_1}\dots \pi ^N_{k, l_N}}{\nu ^{N-1}_k} \Vert y_k- x^i_{l_i}\Vert ^p \\&= \sum _{k=1}^{n_\nu } \sum _{l_1, \dots , l_N} \!\!\!\frac{\pi ^1_{k, l_1}\dots \pi ^N_{k, l_N}}{\nu ^{N-1}_k} \sum _{i=1}^N\lambda _i\Vert y_k- x^i_{l_i}\Vert ^p \end{aligned}$$

and by (4.2) and (4.3), this yields

$$\begin{aligned} \Psi _p(\nu )&\ge \sum _{k=1}^{n_\nu } \sum _{l_1, \dots , l_N} \!\!\!\!\frac{\pi ^1_{k, l_1}\dots \pi ^N_{k, l_N}}{\nu ^{N-1}_k} \sum _{i<j}^N \lambda _i\lambda _j \Vert x^i_{l_i} - x^j_{l_j}\Vert ^p \\&= \sum _{i<j}^N \lambda _i\lambda _j \sum _{k=1}^{n_\nu }\sum _{l_i, l_j} \frac{\pi ^i_{k, l_i}\pi ^j_{k, l_j}}{\nu _k} \Vert x^i_{l_i} - x^j_{l_j}\Vert ^p. \end{aligned}$$

It is straightforward to check that

$$\begin{aligned} \sum _{k=1}^{n_\nu }\sum _{l_i, l_j} \frac{\pi ^i_{k, l_i}\pi ^j_{k, l_j}}{\nu _k} \delta (x^i_{l_i}, x^j_{l_j}) \in \Pi (\mu ^i, \mu ^j), \end{aligned}$$

and so we get

$$\begin{aligned} \Psi _p(\nu ) \ge \sum _{i<j}^N \lambda _i\lambda _j {\mathcal {W}}_p^p(\mu ^i, \mu ^j). \end{aligned}$$

\(\square \)

Equipped with (4.4), we can see that already the simple choices \(\nu =\mu ^j\) and \(\nu =\sum _{i=1}^N\lambda _i\mu ^i\) for the initial measure approximate the optimal barycenter to some extent.

Proposition 4.3

Let \(p\in \{1, 2\}\) and \({{\hat{\nu }}}\) be an optimal barycenter in (2.1).

  1. (i)

    For \(\nu {:}{=}\mu ^j\), it holds that

    $$\begin{aligned} \frac{\Psi _p(\nu )}{\Psi _p({{\hat{\nu }}})} \le \frac{1}{\lambda _j}. \end{aligned}$$

    Note that in particular, if \(j\in {{\,\textrm{argmax}\,}}_{i=1}^N \lambda _i\), then \(\Psi _p(\nu )/\Psi _p({{\hat{\nu }}})\le N\).

  2. (ii)

    Let \(\nu {:}{=}\sum _{i=1}^N\lambda _i\mu ^i\), then

    $$\begin{aligned} \frac{\Psi _p(\nu )}{\Psi _p({{\hat{\nu }}})} \le 2. \end{aligned}$$
  3. (iii)

    If \(\nu \) is chosen randomly as one of the \(\mu ^i\) with probabilities \(\lambda _i\), then also

    $$\begin{aligned} \frac{{\mathbb {E}}[\Psi _p(\nu )]}{\Psi _p({{\hat{\nu }}})} \le 2. \end{aligned}$$

Proof

  1. (i)

    Let \(\nu {:}{=}\mu ^j\), then we see that

    $$\begin{aligned} \Psi _p(\nu ) = \Psi _p(\mu ^j) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p(\mu ^i, \mu ^j). \end{aligned}$$

    By (4.4),

    $$\begin{aligned} \Psi _p({{\hat{\nu }}}) \ge \sum _{i<j}^N\lambda _i\lambda _j {\mathcal {W}}_p^p(\mu ^i, \mu ^j) \ge \lambda _j \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p(\mu ^i, \mu ^j), \end{aligned}$$

    such that

    $$\begin{aligned} \frac{\Psi _p(\nu )}{\Psi _p({{\hat{\nu }}})} \le \frac{1}{\lambda _j}. \end{aligned}$$
  2. (ii)

    For the choice \(\nu {:}{=}\sum _{i=1}^N\lambda _i\mu ^i\), taking \(\pi ^{ij}\in {{\,\textrm{argmin}\,}}_{\pi \in \Pi (\mu ^i, \mu ^j)} \langle c_p, \pi \rangle \), we note that

    $$\begin{aligned} \sum _{j=1}^N\lambda _j \pi ^{ji} \in \Pi (\nu , \mu ^i). \end{aligned}$$

    Hence,

    $$\begin{aligned} \Psi _p(\nu )&= \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p\Big (\sum _{j=1}^N\lambda _j\mu ^j, \mu _i\Big ) \le \sum _{i=1}^N\lambda _i \langle c_p, \sum _{j=1}^N\lambda _j \pi ^{ji}\rangle \\&= 2\sum _{i<j}^N\lambda _i\lambda _j{\mathcal {W}}_p^p(\mu ^i, \mu ^j), \end{aligned}$$

    such that

    $$\begin{aligned} \frac{\Psi _p(\nu )}{\Psi _p({{\hat{\nu }}})} \le 2. \end{aligned}$$
  3. (iii)

    This follows similarly as (ii) does by linearity of expectation.

\(\square \)

In general, there is no polynomial-time algorithm that will achieve an error arbitrarily close to 1 with high probability, see [14]. In light of this result, it is interesting to see that it is possible to obtain a relative error bound of 2 as in [33], but without performing any computations. However, note that merely using a mixture of the inputs yields rather useless barycenter approximations in practice; consider, e.g., two distinct Dirac measures.

Although we will see that the bounds above are still more or less sharp for Algorithms 1–4, these algorithms perform a lot better in practice. Moreover, these bounds are typically drastically improved as soon as a specific problem is given, see Remark 4.6 and Sect. 5.

Using one of the mentioned trivial choices as initial measures, all algorithms above aim to improve the approximation quality using the mapping \(G_{\lambda , \pi }^p\). Next, we show that given any approximate barycenter \(\nu \), executing \(G_{\lambda , \pi }^p\) on \(\nu \) never makes the approximation worse, if we choose the OT plans \(\pi ^i \in \Pi (\nu , \mu ^i)\) to be optimal.

Proposition 4.4

Given a discrete \(\nu =\sum _{k=1}^{n_\nu }\nu _k\delta (y_k)\in \mathcal P({\mathbb {R}}^d)\), \(p\ge 1\), let

$$\begin{aligned} \pi ^i {:}{=}\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \delta (y_k, {x^i_l}) \in {{\,\text {argmin}\,}}_{\pi \in \Pi ({\nu , \mu ^i})} \langle c_p, \pi \rangle \end{aligned}$$

be optimal transport plans. Then it holds

$$\begin{aligned} \Psi _p(G_{\lambda , \pi }^p(\nu )) \le \Psi _p(\nu ). \end{aligned}$$

Proof

By definition of \(\pi ^i\), we have for all \(i=1, \dots , N\) that

$$\begin{aligned} {\mathcal {W}}_p^p(\nu , \mu ^i) = \langle c_p, \pi ^i\rangle = \sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i}\pi ^i_{k, l}\Vert y_k - x_l^i \Vert ^p. \end{aligned}$$

Set

$$\begin{aligned} {{\tilde{\pi }}}^i {:}{=}\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \delta (m_k, x_l^i) \in \Pi ({G_{\lambda , \pi }^p(\nu ), \mu ^i}), \end{aligned}$$

where \(m_k = M_{\lambda , \pi }^p(y_k)\). Then it holds that

$$\begin{aligned} \Psi _p(G_{\lambda , \pi }^p(\nu ))&= \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p(G_{\lambda , \pi }^p(\nu ), \mu ^i) \le \sum _{i=1}^N\lambda _i\langle c_p, {{{\tilde{\pi }}}^i\rangle } \\&= \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i}\pi ^i_{k, l}\Vert m_k-x_l^i \Vert ^p = \sum _{k=1}^{n_\nu } \nu _k \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k} \Vert m_k-x_l^i\Vert ^p \\&= \sum _{k=1}^{n_\nu } \nu _k \min _{m\in {\mathbb {R}}^d}\sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k} \Vert m-x_l^i\Vert ^p \\&\le \sum _{k=1}^{n_\nu }\nu _k\sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k} \Vert y_k-x_l^i\Vert ^p = \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i}\pi ^i_{k, l} \Vert y_k-x_l^i\Vert ^p \\&= \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p({\nu , \mu ^i}) = \Psi _p(\nu ). \end{aligned}$$

\(\square \)

Combining the results above, we immediately get the following error bounds for the algorithms introduced in Sect. 3.

Corollary 4.5

Let \(p\in \{1, 2 \}\) and let \({{\hat{\nu }}}\) be an optimal barycenter.

  1. (i)

    If \({{\tilde{\nu }}}\) is obtained by Algorithm 1 (case \(p=2\)) or Algorithm 2 (case \(p=1\)), then it holds that

    $$\begin{aligned} \frac{\Psi _p({{\tilde{\nu }}})}{\Psi _p({{\hat{\nu }}})} \le \frac{1}{\lambda _1} \quad \text {or} \quad \frac{\Psi _p({{\tilde{\nu }}})}{\Psi _p({{\hat{\nu }}})} \le \frac{1+\varepsilon }{\lambda _1}, \end{aligned}$$

    respectively. Moreover, if instead the reference measure is chosen randomly with probabilities equal to the corresponding \(\lambda _i\), then

    $$\begin{aligned} \frac{{\mathbb {E}}[\Psi _p({{\tilde{\nu }}})]}{\Psi _p({{\hat{\nu }}})} \le 2 \quad \text {or}\quad \frac{{\mathbb {E}}[\Psi _p({{\tilde{\nu }}})]}{\Psi _p({{\hat{\nu }}})} \le 2(1+\varepsilon ). \end{aligned}$$
  2. (ii)

    If \({{\tilde{\nu }}}\) is obtained by Algorithm 3 (case \(p=2\)) or Algorithm 4 (case \(p=1\)), then it holds that

    $$\begin{aligned} \frac{\Psi _p({{\tilde{\nu }}})}{\Psi _p({{\hat{\nu }}})} \le 2 \quad \text {or}\quad \frac{\Psi _p({{\tilde{\nu }}})}{\Psi _p({{\hat{\nu }}})} \le 2(1+\varepsilon ), \end{aligned}$$

    respectively.

Proof

This follows immediately by combining Propositions 4.3 and 4.4, and the fact that

$$\begin{aligned} \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k}\Vert m - x_l^i\Vert \end{aligned}$$

is only optimized by \(m_k\) up to a factor \((1+\varepsilon )\) for every \(k=1, \dots , n_\nu \) in the case \(p=1\). \(\square \)

Remark 4.6

Next, we show how to improve on the 2-approximation bound for a specific given problem. We assume that we are given optimal or close to optimal transport plans

$$\begin{aligned} \pi ^i = \sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \delta ( y_k, x_l^i) \in \Pi (\nu , \mu ^i), \quad i=1, \dots , N. \end{aligned}$$

In case of the pairwise algorithm (Algorithms 3 and 4), we use

$$\begin{aligned} \pi ^i = \sum _{j=1}^N\lambda _j \pi ^{ji} \in \Pi (\nu , \mu ^i), \quad \text {where} \quad \pi ^{ji}\in {{\,\textrm{argmin}\,}}_{\pi \in \Pi (\mu ^j, \mu ^i)} \langle c_p, \pi \rangle . \end{aligned}$$

Given our approximate barycenter

$$\begin{aligned} {{\tilde{\nu }}} = \sum _{k=1}^{n_\nu } \nu _k \delta (m_k), \quad m_k = M_{\lambda , \pi }^p(y_k), \end{aligned}$$

consider again

$$\begin{aligned} {{\tilde{\pi }}}^i {:}{=}\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \delta ( m_k, x_l^i) \in \Pi ({{\tilde{\nu }}}, \mu ^i). \end{aligned}$$

Then

$$\begin{aligned} \Psi _p({{\tilde{\nu }}}) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_p^p({{\tilde{\nu }}}, \mu ^i) \le \sum _{i=1}^N\lambda _i\langle c_p, {{\tilde{\pi }}}^i\rangle = \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \Vert m_k- x_l^i\Vert ^p. \end{aligned}$$

Together with (4.4), this gives

$$\begin{aligned} \frac{\Psi _p({{\tilde{\nu }}})}{\Psi _p({{\hat{\nu }}})} \le \frac{\sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i} \pi ^i_{k, l} \Vert m_k- x_l^i\Vert ^p}{\sum _{i<j}^N \lambda _i\lambda _j {\mathcal {W}}_p^p(\mu ^i, \mu ^j)}. \end{aligned}$$
(4.5)

In the case \(p=2\), since

$$\begin{aligned} m_k = M_{\lambda , \pi }^p(y_k) = \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i} \frac{\pi ^i_{k, l}}{\nu _k} x^i_l \quad \text {with} \quad \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i} \frac{\pi ^i_{k, l}}{\nu _k} = 1, \end{aligned}$$

by incorporating (4.1), the denominator in (4.5) simplifies to

$$\begin{aligned} \Psi _2({{\tilde{\nu }}})&\le \sum _{k=1}^{n_\nu } \nu _k \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k} \Vert m_k-x_l^i\Vert ^2 \\&= \sum _{k=1}^{n_\nu } \nu _k \Big ( \sum _{i=1}^N\lambda _i\sum _{l=1}^{n_i}\frac{\pi ^i_{k, l}}{\nu _k} \Vert y_k-x_l^i\Vert ^2 - \Vert m_k-y_k\Vert ^2\Big ) \\&= \sum _{i=1}^N\lambda _i\sum _{k=1}^{n_\nu }\sum _{l=1}^{n_i}\pi ^i_{k, l}\Vert m_k-x_l^i\Vert ^2 -\sum _{k=1}^{n_\nu } \nu _k\Vert m_k-y_k\Vert ^2 \\&= \Psi _2(\nu ) -\sum _{k=1}^{n_\nu } \nu _k\Vert m_k-y_k\Vert ^2, \end{aligned}$$

such that by Proposition 4.3 (ii), we get

$$\begin{aligned} \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2({{\hat{\nu }}})} \le 2 - \frac{\sum _{k=1}^{n_\nu } \nu _k\Vert m_k-y_k\Vert ^2}{\sum _{i<j}^N\lambda _i\lambda _j {\mathcal {W}}_2^2(\mu ^i, \mu ^j)}. \end{aligned}$$
(4.6)

Either way, for both \(p=1, 2\), the right-hand sides of (4.5) and (4.6) can be evaluated with almost no computational overhead after the execution of Algorithms Algorithms 3 and 4, since the optimal transport plans \(\pi ^{ij}\) between \(\mu ^i\) and \(\mu ^j\) have already been computed. This usually gives bounds much closer to one than the worst-case guarantees in Corollary 4.5.

Finally, we discuss the sharpness of the bounds in Corollary 4.5.

Proposition 4.7

Let \(N\ge 2\) and consider the case with \(\lambda =(\frac{1}{N}, \dots , \frac{1}{N})\in \Delta _N\). There exist measures \(\mu ^1, \mu ^2 = \mu ^3 = \dots = \mu ^N\), such that if \({{\hat{\nu }}}\) is an optimal barycenter, the following hold true:

  1. (i)

    Let \({{\tilde{\nu }}}\) be computed with Algorithm 1, then

    $$\begin{aligned} \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2({{\hat{\nu }}})} = N = \frac{1}{\lambda _1}. \end{aligned}$$

    If the reference measure is chosen uniformly at random, then

    $$\begin{aligned} \frac{{\mathbb {E}}[\Psi _2({{\tilde{\nu }}})]}{\Psi _2({{\hat{\nu }}})} = 2 - \frac{1}{N} \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$
  2. (ii)

    Let \({{\tilde{\nu }}}\) be computed with Algorithm 2, then

    $$\begin{aligned} \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1({{\hat{\nu }}})} = N-1 = \frac{1}{\lambda _1} - 1. \end{aligned}$$

    If the reference measure is chosen uniformly at random, then

    $$\begin{aligned} \frac{{\mathbb {E}}[\Psi _1({{\tilde{\nu }}})]}{\Psi _1({{\hat{\nu }}})} = 2\Big (1 - \frac{1}{N}\Big ) \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$
  3. (iii)

    Let \({{\tilde{\nu }}}\) be computed with Algorithm 3, then

    $$\begin{aligned} \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2({{\hat{\nu }}})} \ge \frac{N-1}{N}\Big ( 1+ \frac{N-1}{N} \Big ) \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$
  4. (iv)

    Let \({{\tilde{\nu }}}\) be computed with Algorithm 4, then

    $$\begin{aligned} \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1({{\hat{\nu }}})} = 2 - \frac{1}{N} \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$

Proof

We consider

$$\begin{aligned} \mu ^1 {:}{=}\delta (0), \qquad \mu ^2= \ldots = \mu ^N {:}{=}\frac{1}{2} (\delta (-1)+\delta (1)). \end{aligned}$$
  1. (i)

    For \(\pi ^i\) defined as in Algorithm 1, it holds

    $$\begin{aligned} \pi ^i = \frac{1}{2} (\delta (0, -1) + \delta (0, 1)), \qquad i=2, \dots , N. \end{aligned}$$

    and thus

    $$\begin{aligned} {{\tilde{\nu }}} = \delta \Big (\frac{1}{2} (-1+1)\Big ) = \delta (0) = \mu ^1. \end{aligned}$$

    Thus,

    $$\begin{aligned} \Psi _2({{\tilde{\nu }}}) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_2^2({{\tilde{\nu }}}, \mu ^i) = \frac{N-1}{N}. \end{aligned}$$

    On the other hand, consider

    $$\begin{aligned} \nu = \frac{1}{2} \Big (\delta \Big (-\frac{N-1}{N}\Big ) + \delta \Big (\frac{N-1}{N}\Big )\Big ), \end{aligned}$$

    then

    $$\begin{aligned} \Psi _2(\nu ) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_2^2(\nu , \mu ^i) = \frac{1}{N} \Big ( \Big ( \frac{N-1}{N} \Big )^2 + (N-1)\Big ( \frac{1}{N} \Big )^2 \Big ) = \frac{N-1}{N^2}, \end{aligned}$$

    such that

    $$\begin{aligned} \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2({{\hat{\nu }}})} \ge \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2(\nu )} = N = \frac{1}{\lambda _1}. \end{aligned}$$
  2. (ii)

    We only need to compute the following medians:

    $$\begin{aligned}&{{\,\textrm{argmin}\,}}_{m\in {\mathbb {R}}^d} \frac{1}{N} \Vert 0 - m \Vert + \frac{1}{2} \sum _{i=2}^N \frac{1}{N} (\Vert -1 - m\Vert + \Vert 1 - m\Vert ) = 0, \\&{{\,\textrm{argmin}\,}}_{m\in {\mathbb {R}}^d} \frac{1}{N} \Vert 0 - m \Vert + \sum _{i=2}^N \frac{1}{N} (\Vert -1 - m\Vert ) = -1, \quad \text {and}\\&{{\,\textrm{argmin}\,}}_{m\in {\mathbb {R}}^d} \frac{1}{N} \Vert 0 - m \Vert + \sum _{i=2}^N \frac{1}{N} (\Vert 1 - m\Vert ) = 1. \end{aligned}$$

    Then we see that \({{\tilde{\nu }}} = \mu ^1\), such that

    $$\begin{aligned} \Psi _1({{\tilde{\nu }}}) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_1(\mu ^1, \mu ^i) = \frac{1}{N} \cdot 0 + \Big (1-\frac{1}{N}\Big ) \cdot 1 = 1-\frac{1}{N}, \end{aligned}$$

    and for any \(j\in \{ 2, \dots , N \}\),

    $$\begin{aligned} \Psi _1(\mu ^j) = \frac{1}{N}\cdot 1 + \Big (1- \frac{1}{N}\Big )\cdot 0 = \frac{1}{N}, \end{aligned}$$

    which leads to

    $$\begin{aligned} \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1({{\hat{\nu }}})} \ge \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1(\mu ^j)} = N-1 = \frac{1}{\lambda _1}-1. \end{aligned}$$

    For the randomized case, we get

    $$\begin{aligned} {\mathbb {E}} [\Psi _1({{\tilde{\nu }}})]&= \frac{1}{N} \Psi _1(\mu ^1) + \Big (1- \frac{1}{N}\Big ) \Psi _1(\mu ^j) = \frac{1}{N}\cdot \Big (1- \frac{1}{N}\Big ) + \Big (1- \frac{1}{N}\Big )\cdot \frac{1}{N} \\&= 2\frac{1}{N}\Big (1- \frac{1}{N}\Big ), \end{aligned}$$

    such that

    $$\begin{aligned} \frac{{\mathbb {E}} [\Psi _1({{\tilde{\nu }}})]}{\Psi _1({{\hat{\nu }}})} \ge \frac{2\frac{1}{N} (1- \frac{1}{N})}{\frac{1}{N}} = 2\Big (1- \frac{1}{N}\Big ). \end{aligned}$$
  3. (iii)

    We get for \(i=2, \dots , N\) that

    $$\begin{aligned} \pi ^{ij} = {\left\{ \begin{array}{ll} \frac{1}{2} (\delta (-1,0)+\delta (1, 0)), &{} j=1, \\ \frac{1}{2} (\delta (-1,-1)+\delta (1, 1)), &{} j=2, \dots , N, \end{array}\right. } \end{aligned}$$

    and hence

    $$\begin{aligned} {{\tilde{\nu }}}^i&= \frac{1}{2} \Big ( \delta \Big ( \frac{N-1}{N}\cdot (-1) + \frac{1}{N} \cdot 0 \Big ) + \Big ( \delta \Big ( \frac{N-1}{N}\cdot 1 + \frac{1}{N} \cdot 0 \Big ) \Big ) \\&= \frac{1}{2} \Big ( \delta \Big ( -\frac{N-1}{N} \Big ) + \delta \Big ( \frac{N-1}{N} \Big ) \Big ). \end{aligned}$$

    Thus,

    $$\begin{aligned} {{\tilde{\nu }}} = \frac{1}{N} \delta (0) + \frac{N-1}{2N} \Big ( \delta \Big ( -\frac{N-1}{N} \Big ) + \delta \Big ( \frac{N-1}{N} \Big ) \Big ). \end{aligned}$$

    Hence, it is easy to compute that

    $$\begin{aligned} {\mathcal {W}}_2^2({{\tilde{\nu }}}, \mu ^i) = {\left\{ \begin{array}{ll} \frac{N-1}{N} ( \frac{N-1}{N} )^2 = ( \frac{N-1}{N} )^3, &{}i=1 \\ \frac{1}{N} ( \frac{N-1}{N} )^2 + \frac{N-1}{N} ( \frac{1}{N} )^2 = \frac{1}{N^3} N(N-1) = \frac{N-1}{N^2}, &{}i=2, \dots , N, \end{array}\right. } \end{aligned}$$

    such that

    $$\begin{aligned} \Psi _2({{\tilde{\nu }}}) = \frac{1}{N} \Big (\Big ( \frac{N-1}{N}\Big )^3 + (N-1)\Big (\frac{N-1}{N^2}\Big ) \Big ) = \frac{N-1}{N^2} \Big ( \Big (\frac{N-1}{N}\Big )^2 + \frac{N-1}{N} \Big ). \end{aligned}$$

    Finally, considering

    $$\begin{aligned} \nu = \frac{1}{2} \Big (\delta \Big (-\frac{N-1}{N}\Big ) + \delta \Big (\frac{N-1}{N}\Big )\Big ), \end{aligned}$$

    we get

    $$\begin{aligned} \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2({{\hat{\nu }}})} \ge \frac{\Psi _2({{\tilde{\nu }}})}{\Psi _2(\nu )} = \Big (\frac{N-1}{N}\Big )^2 + \frac{N-1}{N} = \frac{N-1}{N}\Big ( 1 + \frac{N-1}{N}\Big ) \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$
  4. (iv)

    In this case, we get

    $$\begin{aligned} {{\tilde{\nu }}} = \frac{1}{N} \delta (1) + \frac{N-1}{2N}\Big ( \delta (-1) + \delta (1) \Big ). \end{aligned}$$

    Compute

    $$\begin{aligned} \Psi _1({{\tilde{\nu }}}) = \sum _{i=1}^N\lambda _i{\mathcal {W}}_1({{\tilde{\nu }}}, \mu ^i) = \frac{1}{N} \cdot \frac{N-1}{N} + \frac{N-1}{N} \cdot \frac{1}{N} = \frac{2(N-1)}{N^2}. \end{aligned}$$

    On the other hand, for any \(j\in \{ 2, \dots , N \}\),

    $$\begin{aligned} \Psi _1(\nu ^j) = \frac{1}{N} {\mathcal {W}}_1(\nu ^j, \nu ^1) = \frac{1}{N}, \end{aligned}$$

    such that

    $$\begin{aligned} \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1({{\hat{\nu }}})} \ge \frac{\Psi _1({{\tilde{\nu }}})}{\Psi _1(\nu ^j)} = 2\frac{\frac{N-1}{N^2}}{\frac{1}{N}} = 2\Big ( 1- \frac{1}{N} \Big ) \overset{N\rightarrow \infty }{\longrightarrow }\ 2. \end{aligned}$$

\(\square \)

Remark 4.8

Intuitively, the example used in the proof of Proposition 4.7 is based on the fact that the analyzed algorithms can not split \(\mu ^1=\delta (0)\) into two Dirac measures with weight 1/2, in which case the approximations would be optimal. We chose the example in the proof for simplicity of exposition. However, it is also possible to show the same sharpness results using measures \(\mu ^1, \dots , \mu ^N\) that all have two support points. To this end, for N odd and some small \(\varepsilon >0\), consider

$$\begin{aligned} \mu ^1&{:}{=}\frac{1}{2} (\delta (0, -\varepsilon ) + \delta (0, \varepsilon )), \\ \mu ^2 = \mu ^4 = \dots = \mu ^{N-1}&{:}{=}\frac{1}{2} (\delta (-1, -\varepsilon ) + \delta (1, \varepsilon )) \\ \mu ^3 = \mu ^5 = \dots = \mu ^N&{:}{=}\frac{1}{2} (\delta (-1, \varepsilon ) + \delta (1, -\varepsilon )).\end{aligned}$$

5 Numerical results

We present a numerical comparison of different Wasserstein-2 barycenter algorithms, the computation of a Wasserstein-1 barycenter, and, as applications, an interpolation between measures and textures, respectively. To compute the exact two-marginal transport plans of the presented algorithms, we used the emd function of the Python-OT (POT 0.7.0) package [55], which is a wrapper of the network simplex solverFootnote 1 from [8], which, in turn, is based on an implementation in the LEMON C++ library.Footnote 2

5.1 Numerical comparison

In this section, we compare different Wasserstein-2 barycenter algorithms in terms of accuracy and runtime. We would like to include popular algorithms as iterative Bregman projections into the comparison. However, many of these algorithms operate in a fixed-support setting, that is, they only optimize over the weights of some a priori chosen support grid. On the other hand, free-support methods are the ideal candidate for sparse and possibly high-dimensional point cloud data, i.e., if such a grid structure is not present. An approximation of such data with a coarse grid decreases the accuracy of the solution, but a fine grid increases the runtime of the fixed-support methods. Hence, the fair choice of a comparison data set is challenging.

We attempt to solve this problem by choosing a grid data set with relatively few nonzero mass weights, that has nevertheless been commonly used as a benchmark example in the literature, also for fixed-support algorithms. It originates from [15] and consists of \(N=10\) ellipses shown in Fig. 2, given as images of \(60\times 60\) pixels. We take \(\lambda \equiv 1/N\).

First, we compute approximate barycenters \({{\tilde{\nu }}}\) using the presented algorithms in the case \(p=2\), which we call “Reference” and “Pairwise” below.Footnote 3 Furthermore, we compute the barycenter using publicly available implementations for the methods [18, 21, 36], called “Debiased”, “IBP”, “Product”, “MAAIPM” and “Frank–Wolfe” below,Footnote 4 the exact barycenter method from [30] called “Exact” below,Footnote 5 and the method from [23] called “FastIBP” below.Footnote 6 We also tried the BADMMFootnote 7 method from [56], but since it did not converge properly, we do not consider it further.

Fig. 2
figure 2

Data set of 10 nested ellipses

While the fixed-support methods receive the input measures supported on \(\{ 0, \dots , 59/60 \}\times \{ 0, \dots , 59/60 \}\) as gray-valued \(60\times 60\) images, the free-support methods get the measures as a list of support positions and corresponding weights. Clearly, the sparse support of the data is an advantage for the free-support methods. As a means to facilitate the comparison, we execute the reference and pairwise algorithms also as fixed-support versions. Instead of computing optimal solutions in Algorithms 1 and 3 and, we approximate the optimal transport plans \(\pi ^{ij}\) using the Sinkhorn algorithm on the full grid. We call these algorithms “Reference full” and “Pairwise full” below. Note that, as do the implementations of “IBP”, “Debiased” and “Product”, we exploit the fact that the Sinkhorn kernel \(K=\exp (-c/\varepsilon )\) is separable, such that the corresponding convolution can be performed separately in x- and y-direction, see, e.g., [13, Rem. 4.17]. This also reduces memory consumption, since it is not necessary to compute a distance matrix in \({\mathbb {R}}^{3600\times 3600}\). We remark that the runtime of the Sinkhorn algorithms crucially depends on the desired accuracy. In analogy to “IBP”, “Debiased” and “Product” that terminate, once the barycenter measure has a maximum change of \(10^{-5}\) in any iteration, we terminate once this tolerance is reached in the first marginal of \(\pi ^{ij}\). We check for this criterion only every 10-th iteration, since it produces computational overhead (contrary to the aforementioned methods).

For all Sinkhorn methods, we used a parameter of \(\varepsilon =0.002\) and otherwise chose the default parameters. For the reference algorithm, we have chosen the reference measure to be the upper left measure shown in Fig. 2. To compare the runtimes, we executed all codes on the same laptop with Intel i7-8550U CPU and 8GB memory. The Matlab codes were run in Matlab R2020a. The runtimes of the Python codes are averages over several runs, as obtained by Python’s timeit function. The results are shown in Fig. 3 and Table 1.

Fig. 3
figure 3

Barycenters for data set in Fig. 2 computed by different methods. The weight of a support point is indicated by its area in the plot

Table 1 Numerical results for the ellipse barycenter problem. The runtime is measured in seconds. The ranking is the sum of the standard scores of the logarithm of the relative error and the runtime, respectively. The best values of all approximative algorithms are highlighted in bold

While the exact method has a very high runtime, no approximative method achieves a perfect relative error of \(\Psi _2({{\tilde{\nu }}})/\Psi _2({{\hat{\nu }}}) = 1\). However, the error is well below 2 for all methods, which is a lot better than the worst case bounds shown above. In fact, using the problem-adapted bounds as outlined in Remark 4.6, without knowledge of \({{\hat{\nu }}}\), the pairwise algorithm already guarantees a relative error of at most \(1.64\%\). Whereas the pairwise algorithm achieves the lowest error of all approximative algorithms with around \(0.12\%\), the reference algorithm achieves the lowest runtime of 0.05 seconds. Notably, the FastIBP method is a lot slower than IBP whilst producing a more blurry result, which might indicate an implementation issue. While the Frank–Wolfe method suffers from outliers, the support of most fixed-support methods is more extended than exact barycenter’s support, since Sinkhorn-barycenters have dense support.

We attempt to measure the best compromise between low error and runtime by means of the sum of the standard scores of the logarithmic relative errors and runtimes, respectively, where the standard score or zscore is the value normalized by the population mean and standard deviation. Table 1 is sorted according to this ranking score. The reference and pairwise algorithm are the best with respect to this metric. As expected, the full-support versions of the reference and pairwise algorithms have worse runtime and also accuracy, which can likely be explained by the errors of the Sinkhorn algorithm. Nevertheless, they offer a competitive tradeoff between speed and accuracy with respect to the other methods, which shows that the advantage of the framework considered in this paper is not only due to the sparse support of the chosen data set. Altogether, the results of the proposed algorithms look promising.

Fig. 4
figure 4

Barycenters computed with Algorithms 2 and 4 and for the data set in Fig. 2 and cost functions \(c_1(x, y)=\Vert x - y\Vert \) and \(c_2(x, y)=\Vert x - y\Vert ^2\). The weight of a support point is indicated by its area in the plot

5.2 Wasserstein-1 barycenters

Next, we compute approximate Wasserstein-1 barycenters of the same data set as in the previous Sect. 5.1 using the Algorithms 2 and 4. The results are depicted in Fig. 4 in the top row.

Note that the elliptic structure of the barycenter is only retained to some degree, which can probably be explained by the choice of \(c_1\) as the cost function. For example, it is easy to show that the OT plans corresponding to \({\mathcal {W}}_2^2\) are translation equivariant. On the other hand, this property fails for any other \(p\in [1, 2) \cup (2, \infty ]\), as it is easy to derive from the example with \(\mu , \nu \in {\mathcal {P}}({\mathbb {R}}^2)\) defined by

$$\begin{aligned} \mu {:}{=}\frac{1}{2} (\delta (0, 0) + \delta (1, 0)), \qquad \nu {:}{=}\frac{1}{2} (\delta (0, 0) + \delta (0, 1)).\end{aligned}$$

Thus, we also execute algorithms Algorithms 2 and 4, where we swap \(c_1\) for the squared Euclidean costs \(c_2\) in order to compute the OT plans \(\pi ^{ij}\in \Pi (\mu ^i, \mu ^j)\), but continue to compute the barycenter support using Weiszfeld’s algorithm. The results are shown in Fig. 4 in the bottom row.

Now the elliptic structure is preserved a lot better and the results are very similar to the Wasserstein-2 barycenters. We conclude that the choice of cost function had a larger impact on the results than whether the barycenter support is constructed using the means or geometric medians. Algorithms 2 and 4 with \(c_2\) thus seem like an interesting alternative to Algorithms 1 and 3 in the case where one expects outlier measures, since the median is more robust to outliers than the mean, see, e.g. [57].

5.3 Multiple different sets of weights

For this numerical application, we compute barycenters between four given measures for multiple sets of weights \(\lambda ^k=(\lambda _1^k, \dots , \lambda _N^k)\), \(\lambda ^k\in \Delta _4\), \(k=1, \dots , K\), obtaining an interpolation between those measures. An advantage of the presented algorithms for that application is that the optimal transport plans between the input measures, which are the bottleneck computations, only need to be performed once, whereas the matrix multiplications for interpolations with new weights are fast. We use the proposed algorithms for a data set of four measures given as images of size \(50\times 50\), for sets of weights that bilinearly interpolate between the four unit vectors. The original measures are shown in the four corners of Fig. 6. For the reference algorithm, we use the upper left measure as the reference measure. The results are shown in Figs. 5 and 6.

Fig. 5
figure 5

Approximate barycenters for different sets of weights computed by Algorithm 1

Fig. 6
figure 6

Approximate barycenters for different sets of weights computed by Algorithm 3

While the running time of the reference algorithm is shorter, its solution has several artifacts, in particular when the weight \(\lambda ^k_1\) of the reference measure is low. On the other hand, through effectively averaging the reference algorithm for different choices of the reference measure, the pairwise algorithm is able smooth out some of these artifacts. We compare the results of both algorithms for \(\lambda =(0.04, 0.16, 0.16, 0.64)\) in Fig. 7. We also computed the upper error bound \(\eta \) of the pairwise algorithm given by (4.6) exemplarily for uniform weights, which is \(3.6\%\).

Fig. 7
figure 7

Close comparison of two approximate barycenters of the reference and pairwise algorithms for the weights \(\lambda =(0.04, 0.16, 0.16, 0.64)\)

5.4 Texture interpolation

For another application, we lift the experiment of Sect. 5.3 from interpolation of measures in Euclidean space to interpolation of textures via the synthesis method from [6], using their publicly available source code.Footnote 8 While the authors already interpolated between two different textures in that paper, requiring only the solution of a two-marginal optimal transport problem to obtain a barycenter, we can do this for multiple textures using approximate barycenters for multiple measures. Briefly, the authors proposed to encode a texture as a collection of smaller patches \(F_j\), where each, say, \(4\times 4\)-patch is encoded as a point \(x_j\in {\mathbb {R}}^{16}\). The texture is then modeled as a “feature measure” \(\frac{1}{M} \sum _{j=1}^M\delta (x_j)\in {\mathcal {P}}({\mathbb {R}}^{16})\), such that this description is invariant under different positions of its patches within the image. Finally, this is repeated for image patches at several scales s, obtaining a collection of measures \((\mu ^s)\), \(s=1, \dots , S\). Synthesizing an image is done by optimizing an optimal transport loss between its feature measure and some reference measure (and then summing over s), as obtained, e.g., from a reference image. Thus, the synthesized image tries to imitate the reference image in terms of its feature measures. Here, we choose four texture images of size \(256\times 256\) from the “Describable Textures Dataset” [58]. We compute their feature measures \(\mu ^{1, s}, \dots , \mu ^{4, s}\) for each scale. Next, as in Sect. 5.3, we compute approximate barycenters \({{\tilde{\nu }}}^{k, s}\) for all k and s using the reference algorithm, where k runs over different sets of weights, and perform the image synthesis for each k using the \({{\tilde{\nu }}}^{k, s}\) as feature measures to imitate. The results are shown in Fig. 8. Using this approach, one obtains a visually pleasing interpolation between the four given textures.

Fig. 8
figure 8

Interpolation of four different textures that are displayed in the four corners. The weight set for the barycenter computations performed for each image is shown above each synthesized image

6 Conclusion

In this paper, we derived two straightforward algorithms from a well-known framework for Wasserstein-p barycenters for \(p=1, 2\). We analyzed them theoretically and practically, showing that they are easy to implement, produce sparse solutions and are thus memory-efficient. We validated their speed and precision using numerical examples.

In the future, it would be interesting to generalize the discussed algorithms and bounds to other \(p\ge 1\). For instance, for \(p=\infty \), the barycentric map \(M_{\lambda , \pi }^p\) corresponds to the solution of the so-called smallest-sphere-problem, which can be solved by Welzl’s algorithm [59]. Finding a lower bound as in Proposition 4.2 for general \(p\ge 1\) is not straightforward, since the proofs of (4.2) and (4.3) are specific to \(p=2\) and \(p=1\).