1 Introduction

Over the last decades, optimal transport (OT) has attracted increasing attention in various applications, e.g., barycenter problems [1, 4], image matching [48, 49], and machine learning [3, 29, 37]. As the OT minimization problem is numerically difficult to solve, a common approach is to add an entropy regularization term [18]. This enables us to approximately solve the problem using Sinkhorn iterations [46] by exploiting an explicit relation between the solutions to the corresponding primal and dual problems. These iterations can be implemented in parallel on GPUs, which makes even large-scale problems solvable within reasonable time. Recently, a debiased version of the Sinkhorn divergence was investigated in [25, 41, 44], which has the advantage that it characterizes the weak convergence of measures. However, in many applications, the assumption that the marginal measures are matched exactly appears to be unreasonable. To deal with this issue, unbalanced optimal transport (UOT) [12, 13, 39] was introduced. Here, the hard marginal constraints are replaced by penalizing the \(\varphi \)-divergences between the given measures and the corresponding marginals. By making minimal modifications, we can use Sinkhorn-like iterations and hence the computational complexity and scalability remain the same. For the unbalanced case, the mentioned debiasing was discussed in [47]. For Gaussian distributions, the corresponding divergence even has a closed form [34].

So far, we have commented on OT between two measures. For certain practical tasks such as matching for teams [10], particle tracking [11], and information fusion [19, 24], it is useful to compute transport plans between more than two-marginal measures. This is done in the framework of multi-marginal optimal transport (MOT) [42], where again entropy regularization is possible. The problem was tackled numerically for Coulomb cost in [5] and more general repulsive costs in [17, 31]. Later, an efficient solution for tree-structured costs using Sinkhorn iterations was established in [32]. More recently, these results were extended to even more general cost functions in [2].

In this paper, we want to combine UOT and MOT by investigating unbalanced multi-marginal optimal transport (UMOT). We can build upon the previous papers, but we will see that this has to be done carefully, since various generalizations that seem to be straightforward at the first glance appear to be not. Inspired by the papers [20] and [47], we formulate Sinkhorn iterations for the UMOT problem. Our approach differs from that in [47] as we cannot rely on the 1-Lipschitz continuity of the \((c,\varepsilon )\)-transform, which only holds in the two-marginal case. Instead, we establish uniform boundedness of the iterates and then exploit the compactness of the Sinkhorn operator. As in the two-marginal case, we retain the excellent scalability of the algorithm for tree-structured costs. Furthermore, we prove that these iterations are convergent under mild assumptions.

As one possible application, we discuss the computation of regularized UOT barycenters based on the regularized UMOT problem. The OT barycenter problem was first introduced in [1] and then further studied, e.g., in [4, 16, 21, 40] for the balanced setting. As soon as entropic regularization is applied, we usually obtain a blurred barycenter. One possible solution is to use the debiased Sinkhorn divergence instead [33, 43]. Similarly as in [32] for the balanced case, we observe that solving an UMOT problem instead of minimizing a sum of UOT “distance” reduces the blur considerably. We also validate this observation theoretically. For this purpose, we show that for tree-structured costs UMOT transport plans are already determined by their two-dimensional marginals. A complementary approach for computing barycenters in an unbalanced setting is based on the Hellinger–Kantorovich distance [14, 28].

Furthermore, we provide a numerical UMOT example with a path-structured cost function. We observe that in comparison with a sequence of UOT problems with the same input measures, the UMOT approach has several advantages since the target measure from some optimal UOT plan is not necessarily equal to the source measure of the subsequent UOT plan. For the (single) UMOT plan, this is impossible by construction. Additionally, coupling the problems allows information to be shared between them. Note that there is almost no computational overhead compared to solving the problems sequentially.

Outline of the Paper Section 2 contains the necessary preliminaries. The regularized UMOT problem, in particular the existence and uniqueness of a solution as well as its dual problem, is provided in Sect. 3. In Sect. 4, we derive a Sinkhorn algorithm for solving the regularized UMOT problem and prove its convergence. Then, in Sect. 5, we investigate the barycenter problem with respect to regularized UOT and establish a relation to the regularized UMOT problem with a tree-structured cost function, where the tree is simply star-shaped. Furthermore, we discuss an extension of these considerations to more general tree-structured cost functions. Additionally, we outline an efficient implementation of the Sinkhorn algorithm for tree-structured costs. Numerical examples, which illustrate our theoretical findings, are provided in Sect. 6. Finally, we draw conclusions in Sect. 7.

2 Preliminaries

Throughout this paper, let \({\mathbb {X}}\) be a compact Polish space with associated Borel \(\sigma \)-algebra \({\mathcal {B}}({\mathbb {X}})\). By \({\mathcal {M}}({\mathbb {X}})\), we denote the space of finite signed real-valued Borel measures, which can be identified via Riesz’ representation theorem with the dual space of the continuous functions \(C({\mathbb {X}})\) endowed with the norm \(\Vert f\Vert _{C({\mathbb {X}})}\, {:}{=}\max _{x \in {\mathbb {X}}} |f(x)|\). Denoting the associated dual pairing by \(\langle \mu , f \rangle ~{:}{=}\int _{\mathbb {X}}f \, d\mu \), the space \({\mathcal {M}}({\mathbb {X}})\) can be equipped with the weak*-topology, i.e., a sequence \((\mu _n)_{n \in {\mathbb {N}}} \subset {\mathcal {M}}({\mathbb {X}})\) converges weakly to \(\mu \in {\mathcal {M}}({\mathbb {X}})\), written \(\mu _n \rightharpoonup \mu \), if

$$\begin{aligned} \int _{\mathbb {X}}f \, d\mu _n \rightarrow \int _{\mathbb {X}}f \, d\mu \quad \text {for all } f \in C({\mathbb {X}}). \end{aligned}$$

The associated dual norm of \(\mu \in {\mathcal {M}}({\mathbb {X}})\), also known as total variation, is given by \(\mathrm {TV} (\mu ) = \sup _{\Vert f\Vert _{C({\mathbb {X}})} \le 1} \langle \mu , f \rangle \). By \({\mathcal {M}}^+({\mathbb {X}})\), we denote the subset of nonnegative measures. The support of \(\mu \in {\mathcal {M}}^+({\mathbb {X}})\) is defined as the closed set

$$\begin{aligned} \text {supp}(\mu )~{:}{=}~\bigl \{ x \in {\mathbb {X}}:&\,B \subset {\mathbb {X}}\text { open, }x \in B \\&\implies \mu (B) >0\bigr \}. \end{aligned}$$

For \(\mu \in \, {\mathcal {M}}^+({\mathbb {X}})\) and \(p \in [1,\infty ]\), let \(L^p({\mathbb {X}},\mu )\) be the Banach space (of equivalence classes) of real-valued measurable functions with norm \(\Vert f\Vert _{L^p({\mathbb {X}},\mu )} < \infty .\)

A measure \(\nu \in {\mathcal {M}}({\mathbb {X}})\) is called absolutely continuous with respect to \(\mu \), and we write \(\nu \ll \mu \), if for every \(A \in {\mathcal {B}}({\mathbb {X}})\) with \(\mu (A) = 0\) we have \(\nu (A) = 0\). For any \(\mu , \nu \in {\mathcal {M}}^+({\mathbb {X}})\) with \(\nu \ll \mu \), the Radon–Nikodym derivative

$$\begin{aligned} \sigma _\nu \,~{:}{=}~\frac{\,\mathrm {d}\nu }{\,\mathrm {d}\mu } \in L^1({\mathbb {X}},\mu ) \end{aligned}$$

exists and \(\nu = \sigma _\nu \mu \). Furthermore, \(\mu , \nu \in {\mathcal {M}}({\mathbb {X}})\) are called mutually singular, denoted by \(\mu \perp \nu \), if there exist two disjoint sets \(X_\mu , X_\nu \in {\mathcal {B}}({\mathbb {X}})\) such that \({\mathbb {X}}= X_\mu {\dot{\cup }} X_\nu \) and for every \(A \in {\mathcal {B}}({\mathbb {X}})\) we have \(\mu (A) = \mu (A \cap X_\mu )\) and \(\nu (A) = \nu (A \cap X_\nu )\). Given \(\mu ,\nu \in {\mathcal {M}}^+({\mathbb {X}})\), there exists a unique Lebesgue decomposition \(\mu = \sigma _\mu \nu + \mu ^\perp \), where \(\sigma _\mu \in L^1({\mathbb {X}},\nu )\) and \(\mu ^\perp \perp \nu \). Let \({\mathbb {Y}}\) be another compact Polish space and \(T:{\mathbb {X}}\rightarrow {\mathbb {Y}}\) be a measurable function, i.e., \(T^{-1}(A) \in {\mathcal {B}}({\mathbb {X}})\) for all \(A \in {\mathcal {B}}({\mathbb {Y}})\). Then, the push-forward measure of \(\mu \) by T is defined as \(T_\# \mu ~{:}{=}~\mu \circ T^{-1}\).

Let V be a real Banach space with dual \(V^*\) and dual pairing \(\langle v,x \rangle = v(x)\), \(v \in V^*, x \in V\). For \(F:V \rightarrow (-\infty ,+\infty ]\), the domain of F is given by \( \mathrm {dom} F \,{:}{=}\, \{x \in V: F(x) \in {\mathbb {R}} \}\). If \(\mathrm {dom} F \not = \emptyset \), then F is called proper. By \(\Gamma _0(V)\) we denote the set of proper, convex, lower semi-continuous (lsc) functions mapping from V to \((-\infty ,+\infty ]\). The subdifferential of \(F:V \rightarrow (-\infty ,+\infty ]\) at a point \(x_0 \in \mathrm {dom} F\) is defined as

$$\begin{aligned}&\partial F(x_0)\,{:}{=}\, \bigl \{q \in V^*: F(x) \ge F(x_0) + \langle q,x - x_0 \rangle \bigr \}, \end{aligned}$$

and \(\partial F(x_0) = \emptyset \) if \(x_0 \not \in \mathrm {dom} F\). For a function \(F:V^n \rightarrow (-\infty ,\infty ]\), \(\partial _i F\) denotes the subdifferential of F with respect to the ith component. The Fenchel conjugate \(F^*:V^* \rightarrow (-\infty ,+\infty ]\) is given by

$$\begin{aligned} F^*(q) = \sup _{x \in V} \{ \langle q,x\rangle - F(x) \}. \end{aligned}$$

A nonnegative function \(\varphi \in \Gamma _0({\mathbb {R}})\) satisfying \(\varphi (1) = 0\) and \(\varphi |_{(-\infty ,0)} = \infty \) is called entropy function with recession constant \(\varphi ^\prime _\infty = \lim _{x \rightarrow \infty } \varphi (x)/x\). In this case, \({{\,\mathrm{dom}\,}}(\varphi ^*) = (-\infty , \varphi ^\prime _\infty ]\). For every \(\mu ,\nu \in {\mathcal {M}}^+({\mathbb {X}})\) with Lebesgue decomposition \(\mu = \sigma _\mu \nu + \mu ^\perp \), the \(\varphi \)-divergence \(D_\varphi :{\mathcal {M}}^+({\mathbb {X}}) \times {\mathcal {M}}^+({\mathbb {X}}) \rightarrow [0,\infty ]\) is given in its primal and dual form by

$$\begin{aligned} D_\varphi (\mu ,\nu ) =&\int _{{\mathbb {X}}} \varphi \circ \sigma _\mu \,\mathrm {d}\nu + \varphi ^\prime _\infty \, \mu ^\perp ({\mathbb {X}})\nonumber \\ =&\sup _{f \in C({\mathbb {X}})} \langle \mu , f \rangle - \langle \nu , \varphi ^* \circ f \rangle \end{aligned}$$
(1)

with the convention \(0 \cdot \infty = 0\). The mapping \(D_\varphi \) is jointly convex, weakly lsc and fulfills \(D_\varphi (\mu ,\nu ) \ge 0\), see [39, Cor. 2.9]. Furthermore, we have for \(t > 0\) that \(D_{t\varphi } = t D_{\varphi }\). We will use the following \(\varphi \)-divergences, see also [47].

Example 2.1

  1. (i)

    Let \(\varphi ~{:}{=}~\iota _{\{1\}}\), where \(\iota _{{\mathcal {S}}}\) denotes the indicator function of the set \({{\mathcal {S}}}\), i.e., \(\iota _{{\mathcal {S}}}(x) = 0\) if \(x \in {{\mathcal {S}}}\), and \(\iota _{{\mathcal {S}}}(x) = +\infty \) otherwise. Then, \(\varphi ^*(q) = q\), \(\varphi '_\infty = \infty \), and

    $$\begin{aligned} D_{\varphi }(\mu ,\nu ) = \left\{ \begin{array}{ll} 0 &{}\mathrm {if} \; \mu =\nu ,\\ +\infty &{}\mathrm {otherwise}. \end{array} \right. \end{aligned}$$
  2. (ii)

    For \(\varphi ~{:}{=}~\iota _{[0,\infty )}\), we get \(\varphi ^*(q) = \iota _{(-\infty ,0]}\), \(\varphi '_\infty = 0\), and \(D_{\varphi }(\mu ,\nu ) = 0\).

  3. (iii)

    Consider the Shannon entropy with \(\varphi (x)~{:}{=}~x\ln (x) - x +1\) and the agreement \(0 \ln 0 = 0\). Then, we have that \(\varphi ^*(q) = \exp (q) -1\), \(\varphi '_\infty = \infty \), and the \(\varphi \)-divergence is the Kullback–Leibler divergence \(\mathrm {KL}:{{\mathcal {M}}^+}({\mathbb {X}}) \times {{\mathcal {M}}^+}({\mathbb {X}}) \rightarrow \mathbb [0, +\infty ]\). For \(\mu ,\nu \in {{\mathcal {M}}^+}({\mathbb {X}})\), if the Radon–Nikodym derivative \(\sigma _\mu = \frac{\,\mathrm {d}\mu }{\,\mathrm {d}\nu }\) exists, then

    $$\begin{aligned} \mathrm {KL} (\mu ,\nu )~{:}{=}~\int _{{\mathbb {X}}} \ln (\sigma _\mu ) \, \,\mathrm {d}\mu + \nu ({\mathbb {X}}) - \mu ({\mathbb {X}}), \end{aligned}$$

    and otherwise, we set \(\mathrm {KL} (\mu ,\nu )~{:}{=}~+ \infty \). Note that the \(\mathrm {KL}\) divergence is strictly convex with respect to the first variable.

  4. (iv)

    For \(\varphi (x)~{:}{=}~|x-1|\), we get \(\varphi ^*(q) = \max (-1,q)\) if \( q \le 1\) and \(\varphi ^*(q) = +\infty \) otherwise, \(\varphi '_\infty = 1\), and \( D_{\varphi }(\mu ,\nu ) = \mathrm {TV}(\mu - \nu ) \).

3 Unbalanced Multi-Marginal Optimal Transport

Throughout this paper, we use the following abbreviations. For compact Polish spaces \({\mathbb {X}}_i \ne \emptyset \), \(i=1,\ldots , N\), and measures \(\nu _i\in {\mathcal {M}}^+({\mathbb {X}}_i)\), \(i=1,\ldots , N\), we set \(\nu ~{:}{=}~(\nu _1,\ldots ,\nu _N)\) and

Furthermore, for \(p \in [1,\infty ]\) and \(f_i \in L^p({\mathbb {X}}_i, \nu _i)\), \(i=1,\ldots ,N\), we write \(f~{:}{=}~(f_1,\ldots ,f_N)\) and

where the product space \(L^{p,\times }({\mathbb {X}},\nu )\) is equipped with the \(L^p\) norm of the components. For example, if \({\mathbb {X}}={\mathbb {X}}_1\times {\mathbb {X}}_2\times {\mathbb {X}}_3\), then for \(f=(f_1, f_2, f_3)\in L^{p,\times }({\mathbb {X}},\nu )\) we have

$$\begin{aligned} f^\oplus _{\setminus 2}(x_1, x_2, x_3) = f_1(x_1) + f_3(x_3). \end{aligned}$$

If the domains and measures are clear from the context, we abbreviate the associated norms by \(\Vert \cdot \Vert _p\), \(p \in [1,\infty ]\).

Given \(\mu _i\in {\mathcal {M}}^+({\mathbb {X}}_i)\), \(i=1,\ldots , N\), the measures \(\gamma _i\in {\mathcal {M}}^+({\mathbb {X}}_i)\), \(i=1,\ldots , N\), are called reference measures for \(\mu \) if

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\mu ^\otimes , \gamma ^\otimes )< \infty . \end{aligned}$$

Definition 3.1

Let \(\varepsilon >0\). Given a nonnegative cost \(c \in C ({\mathbb {X}})\), fully supported measures \(\mu _i \in {\mathcal {M}}^+({\mathbb {X}}_i)\), \(i=1,\ldots ,N\), with reference measures \(\gamma _i \in {\mathcal {M}}^+({\mathbb {X}}_i)\), and entropy functions \(\varphi _i \in \Gamma _0({\mathbb {R}})\), \(i=1,\ldots ,N\), the associated regularized unbalanced multi-marginal optimal transport problem \(({{\,\mathrm{UMOT}\,}})\) reads

$$\begin{aligned} {{\,\mathrm{UMOT}\,}}_\varepsilon (\mu )&~{:}{=}~ \inf _{\pi \in {\mathcal {M}}^+({\mathbb {X}})} \int _{{\mathbb {X}}}c\,\mathrm {d}\pi + \varepsilon {{\,\mathrm{KL}\,}}\bigl (\pi , \gamma ^\otimes \bigr ) \nonumber \\&\quad + \sum _{i=1}^N D_{\varphi _i}(P_{{\mathbb {X}}_{i\#}} \pi , \mu _i), \end{aligned}$$
(2)

where \(P_{{\mathbb {X}}_{i\#}} \pi \) is the ith marginal of \(\pi \).

Note that the full support condition is no real restriction as we can choose \({\mathbb {X}}_i = \text {supp}(\mu _i)\). Furthermore, we can implicitly incorporate weights for the marginal penalty terms in (2) by rescaling the entropy functions \(\varphi _i\).

Remark 3.2

(Regularization and reference measures) Typical choices for the reference measures \(\gamma _i\) are

  1. (i)

    \(\gamma _i = \mu _i\) then \({{\,\mathrm{KL}\,}}(\mu ^\otimes , \gamma ^\otimes ) = 0\) and we regularize in \({{\,\mathrm{UMOT}\,}}_\varepsilon \) by \({{\,\mathrm{KL}\,}}(\pi , \mu ^\otimes )\).

  2. (ii)

    \(\gamma _i\) the Lebesgue measure on \({\mathbb {X}}_i \subset {\mathbb {R}}^{d_i}\). Then \({{\,\mathrm{KL}\,}}(\mu ^\otimes , \gamma ^\otimes ) < \infty \) is equivalent to \(\mu _i\) having a density in a so-called Orlicz space, see [15, 41] for details. Furthermore, the regularizer in \({{\,\mathrm{UMOT}\,}}_\varepsilon \) is the entropy \({{\,\mathrm{KL}\,}}(\pi , \gamma ^\otimes ) = E (\pi )\) for continuous measures.

  3. (iii)

    \(\gamma _i\) the counting measure if the \({\mathbb {X}}_i\) are finite. Here, \({{\,\mathrm{KL}\,}}(\mu ^\otimes , \gamma ^\otimes ) < \infty \) is equivalent to \(\mu \) being positive. Then, the regularizer is the entropy for discrete measures \({{\,\mathrm{KL}\,}}(\pi , \lambda ^\otimes )= E(\pi )\).

Definition (2) includes the following special cases:

  • If \(\varphi _i = \iota _{\{1\}}\) for all \(i=1,\ldots ,N\), then we have by Example 2.1 i) the regularized multi-marginal optimal transport (\(\mathrm {MOT}_{\varepsilon }\)) with hard constraints for the marginals. For \(\varepsilon =0\), we deal with the plain multi-marginal optimal transport (MOT) formulation.

  • If \(N=2\), then we are concerned with regularized unbalanced optimal transport (\(\mathrm {UOT}_{\varepsilon }\)). If \(\varphi _1 = \varphi _2 = \iota _{\{1\}}\), we get regularized optimal transport (\(\mathrm {OT}_{\varepsilon }\)), and if \(\varepsilon = 0\), we arrive at the usual optimal transport (OT) formulation.

Regarding existence and uniqueness of minimizers for \({{\,\mathrm{UMOT}\,}}_\varepsilon \), we have the following proposition.

Proposition 3.3

The \({{\,\mathrm{UMOT}\,}}_\varepsilon \) problem (2) admits a unique optimal plan.

Proof

The problem is feasible due to \({{\,\mathrm{KL}\,}}(\mu ^\otimes , \gamma ^\otimes ) < \infty \). Existence follows since \({{\,\mathrm{KL}\,}}(\cdot ,\gamma ^\otimes )\) and hence the whole functional (2) is coercive, and since all involved terms are lsc. For the uniqueness note that all terms in (2) are convex in \(\pi \) and that KL is moreover strictly convex in its first argument. \(\square \)

For applications, the dual formulation of \({{\,\mathrm{UMOT}\,}}_\varepsilon \) is important.

Proposition 3.4

The \({{\,\mathrm{UMOT}\,}}_\varepsilon \) problem (2) admits the dual representation

$$\begin{aligned} {{\,\mathrm{UMOT}\,}}_\varepsilon&(\mu ) = \!\!\sup _{f\in L^{\infty ,\times }({\mathbb {X}},\gamma )} \!\!\! {\mathcal {D}}_\varepsilon ^\varphi (f) + \varepsilon \gamma ^\otimes ({\mathbb {X}}), \end{aligned}$$
(3)

where

$$\begin{aligned} {\mathcal {D}}_\varepsilon ^\varphi (f)~{:}{=}~&- \sum _{i=1}^N \int _{{\mathbb {X}}_i} \varphi _i^*\circ (-f_i) \,\mathrm {d}\mu _i \nonumber \\&- \varepsilon \int _{{\mathbb {X}}} \exp \Bigl (\frac{ f^\oplus - c}{\varepsilon }\Bigr )\,\mathrm {d}\gamma ^\otimes . \end{aligned}$$
(4)

The optimal plan \({\hat{\pi }} \in {\mathcal {M}}^+({\mathbb {X}})\) for the primal problem (2) is related to any tuple of optimal dual potentials \({{\hat{f}}} \in L^{\infty ,\times }({\mathbb {X}},\gamma )\) by

$$\begin{aligned} {\hat{\pi }}= \exp \Bigl (\frac{{{\hat{f}}}^\oplus - c}{\varepsilon }\Bigr )\gamma ^\otimes . \end{aligned}$$
(5)

Furthermore, any pair of optimal dual potentials \({{\hat{f}}}, {{\hat{g}}} \in L^{\infty ,\times }({\mathbb {X}},\gamma )\) satisfies \({{\hat{f}}}^\oplus = {{\hat{g}}}^\oplus \). Moreover, if \(N-1\) of the \(\varphi _i^*\), \(i =1,\dotsc ,N\), are strictly convex, then it holds \({\hat{f}} = {\hat{g}}\).

Proof

First, we set \(V~{:}{=}~ L^{\infty ,\times }({\mathbb {X}},\gamma )\) and \(W ~{:}{=}~ L^\infty ({\mathbb {X}}, \gamma ^\otimes )\), and define \(A:V \rightarrow W\), \(F\in \Gamma _0(V)\) and \(G\in \Gamma _0(W)\) via

$$\begin{aligned}&A(f) ~{:}{=}~ f^\oplus ,\\&F(f) ~{:}{=}~ \sum _{i=1}^N \int _{{\mathbb {X}}_i} \varphi _i^*\circ f_i \,\mathrm {d}\mu _i,\\&G(f) ~{:}{=}~ \varepsilon \int _{{\mathbb {X}}} \exp \Bigl (\frac{f - c}{\varepsilon }\Bigr ) - 1 \,\mathrm {d}\gamma ^\otimes . \end{aligned}$$

Note that \(V^*\) and \(W^*\) are the respective dual spaces of finitely additive signed measures that are absolutely continuous with respect to \(\gamma \) and \(\gamma ^\otimes \), respectively. From [45, Thm. 4] with the choice \(g:{\mathbb {X}}\times {\mathbb {R}}\rightarrow {\mathbb {R}}\),

$$\begin{aligned} g(x, p)&~{:}{=}~ \varepsilon \Big ( \exp \Big ( \frac{p-c(x)}{\varepsilon } \Big )-1 \Big ) \end{aligned}$$

such that \(g^*(x, q)=c(x)q + \varepsilon (\ln (q)q-q+1)\) if \(q\ge 0\) with the convention \(0\ln 0=0\) and \(g^*(x, q)=\infty \) otherwise, we get that the Fenchel conjugate \(G^* \in \Gamma _0(W^*)\) is given by

$$\begin{aligned} G^*(\pi ) = \int _{{\mathbb {X}}} c \sigma _\pi \,\mathrm {d}\gamma ^\otimes + \varepsilon {{\,\mathrm{KL}\,}}\bigl (\sigma _\pi \gamma ^\otimes , \gamma ^\otimes \bigr ) \end{aligned}$$

if there exists nonnegative \(\sigma _\pi \in L^1({\mathbb {X}},\gamma ^\otimes )\) with \(\langle \pi , f \rangle = \int _{\mathbb {X}}f \sigma _\pi \,\mathrm {d}\gamma ^\otimes \) for all \(f \in W\) and \(G^*(\pi ) = \infty \) for all other \(\pi \in W^*\).

Using the definition of the Fenchel conjugate and (1), the function \(F^* \circ A^*\in \Gamma _0(W^*)\) can be expressed for any such \(\pi \) as

$$\begin{aligned} F^*(A^*\pi )&= \!\sup _{f\in L^{\infty ,\times }({\mathbb {X}},\gamma )}\!\langle A^* \pi , f \rangle - \!\sum _{i=1}^N \int _{{\mathbb {X}}_i} \varphi _i^*\circ f_i \,\mathrm {d}\mu _i\\&= \!\sup _{f\in L^{\infty ,\times }({\mathbb {X}},\gamma )} \langle \pi , f^\oplus \rangle - \sum _{i=1}^N \int _{{\mathbb {X}}_i} \varphi _i^*\circ f_i \,\mathrm {d}\mu _i\\&= \!\sup _{f\in L^{\infty ,\times }({\mathbb {X}},\gamma )} \sum _{i=1}^N \int _{{\mathbb {X}}_i} \!f_i \,\mathrm {d}P_{{\mathbb {X}}_{i \#}} \pi - \!\!\int _{{\mathbb {X}}_i} \!\varphi _i^*\circ f_i \,\mathrm {d}\mu _i\\&= \sum _{i=1}^N D_{\varphi _i}(P_{{\mathbb {X}}_{i\#}} \pi , \mu _i). \end{aligned}$$

Now, we obtain the assertion by applying the Fenchel–Rockafellar duality relation

$$\begin{aligned}&\inf _{w \in W^*} \bigl \{ F^*(A^* w) + G^*(w)\bigr \}\\&\quad = \sup _{x \in V}\bigl \{ - F(-x) - G(Ax)\bigr \},\nonumber \end{aligned}$$
(6)

see [23, Thm. 4.1, p. 61]. Due to the definition of \(G^*\), it suffices to consider elements from \(W^*\) that can be identified with elements in \(L^1({\mathbb {X}},\gamma ^\otimes )\). Hence, the problem (6) coincides with (2).

The second assertion follows using the optimality conditions. More precisely, let \(\hat{\pi }\) and \({\hat{f}}\) be optimal. By [23, Chap. 3, Prop. 4.1], this yields \(A {{\hat{f}}} \in \partial G^* ({\hat{\pi }})\) which is equivalent to \(\hat{\pi } \in \partial G(A{\hat{f}})\). Since G is Gâteaux-differentiable with \(\nabla G(f) = \exp \bigl (\frac{f - c}{\varepsilon }\bigr ) \gamma ^\otimes \), we obtain

$$\begin{aligned} \hat{\pi } = \exp \biggl (\frac{{\hat{f}}^\oplus - c}{\varepsilon }\biggr ) \gamma ^\otimes . \end{aligned}$$

Finally, let \({\hat{f}},{\hat{g}} \in L^{\infty ,\times }({\mathbb {X}},\gamma )\) be two optimal dual potentials. The second summand is concave in f and strictly concave in \({{\hat{f}}}^\oplus \). The first summand in (4) is concave. Moreover, if \(N-1\) of the \(\varphi _i\), \(i=1,\dotsc ,N\), are strictly convex, then (4) is strictly concave. Hence, both claims follow. \(\square \)

4 Sinkhorn Algorithm for Solving the Dual Problem

In this section, we derive an algorithm for solving the dual problem (3). We prove its convergence under the assumption that for all \(i = 1, \ldots , N\), we have \(\ln (\sigma _{\mu _i}) \in L^\infty ({\mathbb {X}}_i,\gamma _i)\), where \(\sigma _{\mu _i}\) is the Radon–Nikodym derivative of \(\mu _i\) with respect to the reference measures \(\gamma _i\), and some mild assumptions on the entropy functions \(\varphi _i\).

First, we introduce two operators that appear in the optimality conditions of the dual problem, namely the \((c,\varepsilon )\)-transform and the anisotropic proximity operator.

Definition 4.1

For \(j = 1,\dotsc ,N\), the j-th \((c,\varepsilon )\)-transform \({{\,\mathrm{F}\,}}^{(c,\varepsilon ,j)}:L^{\infty ,\times }({\mathbb {X}},\gamma )\rightarrow L^\infty ({\mathbb {X}}_j,\gamma _j)\) is given by

$$\begin{aligned} {{\,\mathrm{F}\,}}^{(c,\varepsilon ,j)}(f)&= f^{(c,\varepsilon ,j)} \\&~{:}{=}~ \varepsilon \ln (\sigma _{\mu _j}) - \varepsilon \ln \biggl (\int _{{\mathbb {X}}_{\setminus j}} \!\!\!\exp \biggl (\frac{f_{\setminus j}^\oplus - c}{\varepsilon }\biggr ) \,\mathrm {d}\gamma _{\setminus j}^\otimes \biggr ).\nonumber \end{aligned}$$
(7)

This transform was discussed in relation with \({{\,\mathrm{MOT}\,}}_\varepsilon \) in [20], where the following two properties were shown.

Lemma 4.2

Let \(\ln (\sigma _{\mu _i}) \in L^\infty ({\mathbb {X}}_i,\gamma _i)\), \(i = 1,\dotsc ,N\), and \(f \in L^{\infty ,\times }({\mathbb {X}},\gamma )\). Then, the following holds:

  1. (i)

    For every \(j = 1,\dotsc ,N\) it holds

    $$\begin{aligned}&\Vert f^{(c,\varepsilon ,j)} \!+ \lambda _{f,j} \Vert _{\infty } \!\le \! \Vert c \Vert _\infty \!+ \varepsilon \Vert \ln (\sigma _{\mu _j}) \Vert _{\infty }, \end{aligned}$$

    where

    $$\begin{aligned} \lambda _{f,j} ~{:}{=}~ \varepsilon \ln \biggl (\int _{{\mathbb {X}}_{\setminus j}} \exp \biggl (\frac{f_{\setminus j}^\oplus }{\varepsilon }\biggr ) \,\mathrm {d}\gamma _{\setminus j}^\otimes \biggr ). \end{aligned}$$

    In particular, we get that \(f^{(c,\varepsilon ,j)}\) has bounded oscillation

    $$\begin{aligned} \sup _{y \in {\mathbb {X}}_j} f^{(c,\varepsilon ,j)}(y) - \inf _{y \in {\mathbb {X}}_j} f^{(c,\varepsilon ,j)}(y) < \infty . \end{aligned}$$
  2. (ii)

    The nonlinear and continuous operator \({{\,\mathrm{F}\,}}^{(c,\varepsilon ,j)}:L^{\infty ,\times }({\mathbb {X}},\gamma )\rightarrow L^p({\mathbb {X}}_j,\gamma _j)\) is compact for \(p \in [1, \infty )\), i.e., it maps bounded sets to relatively compact sets.

Definition 4.3

For any entropy function \(\varphi \in \Gamma _0({\mathbb {R}})\) and \(\varepsilon > 0\), the anisotropic proximity operator \(\text {aprox}_{\varphi ^*}^{\varepsilon }:{\mathbb {R}}\rightarrow {\mathbb {R}}\) is given by

$$\begin{aligned} \text {aprox}_{\varphi ^*}^{\varepsilon }(p) ~{:}{=}~ \mathop {\mathrm {argmin}}\limits _{q \in {\mathbb {R}}} \bigl \{\varepsilon e^{\frac{p-q}{\varepsilon }} + \varphi ^*(q)\bigr \}. \end{aligned}$$
(8)

Remark 4.4

This operator is indeed well defined. Furthermore, it is 1-Lipschitz and can be given in analytic form for various conjugate entropy functions, see [47].

Example 4.5

Let us have a closer look at the functions from Example 2.1.

  1. (i)

    For \(\varphi = \iota _{\{1\}}\) it holds \(\text {aprox}_{\varphi ^*}^{\varepsilon }(p) = p\).

  2. (ii)

    For \(\varphi =\iota _{[0,\infty )}\), we get \(\text {aprox}_{\varphi ^*}^{\varepsilon }(p) = 0\).

  3. (iii)

    For \(\varphi (x) = t (x \ln (x) -x+1)\) corresponding to the Kullback–Leibler divergence, we have

    $$\begin{aligned} \text {aprox}_{\varphi ^*}^{\varepsilon }(p) = \frac{t}{t+\varepsilon } \, p. \end{aligned}$$
  4. (iv)

    For \(\varphi (x) = t |x-1|\) belonging to the \({{\,\mathrm{TV}\,}}\) distance, it holds

    $$\begin{aligned} \text {aprox}_{\varphi ^*}^{\varepsilon }(p) = \left\{ \begin{array}{rl} -t,&{} p < -t,\\ p, &{} p \in [-t,t],\\ t, &{} p > t. \end{array} \right. \end{aligned}$$

Definition 4.6

The \((c,\varepsilon )\)-transform and the anisotropic proximity operator are concatenated to the j-th Sinkhorn mapping

$$\begin{aligned} S^{(c,\varepsilon ,\varphi ,j)} :L^{\infty ,\times }({\mathbb {X}},\gamma ) \rightarrow L^\infty ({\mathbb {X}}_j,\gamma _j) \end{aligned}$$

defined as \(S^{(c,\varepsilon ,\varphi ,j)}(f) ~{:}{=}~ f^{(c,\varepsilon ,\varphi ,j)}\) with

$$\begin{aligned} f^{(c,\varepsilon ,\varphi ,j)} ~{:}{=}~ - \text {aprox}_{\varphi _j^*}^\varepsilon \bigl ( -f^{(c,\varepsilon ,j)} \bigr ), \end{aligned}$$

where the operator \(\text {aprox}_{\varphi _j^*}^\varepsilon \) is applied pointwise.

Now, we derive the maximizer of \({\mathcal {D}}_\varepsilon ^\varphi \) defined in (4).

Proposition 4.7

Let \(\ln (\sigma _{\mu _i}) \in L^\infty ({\mathbb {X}}_i,\gamma _i)\), \(i=1,\ldots ,N\).

Then, it holds for all \(j = 1,\dotsc ,N\) and \(f \in L^{\infty ,\times }({\mathbb {X}},\gamma )\) that

$$\begin{aligned}&{\mathcal {D}}_\varepsilon ^\varphi (f) \le {} {\mathcal {D}}_\varepsilon ^\varphi (\dotsc ,f_{j-1},f^{(c,\varepsilon ,\varphi ,j)},f_{j+1},\dotsc ) \end{aligned}$$

with equality if and only if \(f_j = f^{(c,\varepsilon ,\varphi ,j)}\). Furthermore, f is a maximizer of \({\mathcal {D}}_\varepsilon ^\varphi \) if and only if \(f_j = f^{(c,\varepsilon ,\varphi ,j)}\) for all \(j=1,\ldots ,N\).

figure e

Proof

We fix \(j \in \{1,\dotsc ,N\}\) and rewrite

$$\begin{aligned} {\mathcal {D}}_\varepsilon ^\varphi (f)&= - \!\sum _{i \ne j} \bigl \langle \mu _i, \varphi _i^*(-f_i) \bigr \rangle - \!\! \int _{{\mathbb {X}}_j}\!\! \varphi _j^*(-f_j) \sigma _{\mu _j}\\&\quad +\, \varepsilon \exp \Bigl (\frac{f_j}{\varepsilon }\Bigr ) \int _{{\mathbb {X}}_{\setminus j}} \!\!\exp \biggl (\frac{f_{\setminus j}^\oplus - c}{\varepsilon }\biggr ) \,\mathrm {d}\gamma _{\setminus j}^\otimes \,\mathrm {d}\gamma _j. \end{aligned}$$

By rearranging the definition (7) of the \((c,\varepsilon )\)-transform, we get that

$$\begin{aligned}&\displaystyle \exp \biggl (\frac{f^{(c,\varepsilon ,j)}}{-\varepsilon } \biggr ) = \frac{1}{\sigma _{\mu _j}} \int _{{\mathbb {X}}_{\setminus j}} \!\!\!\exp \biggl (\frac{f_{\setminus j}^\oplus - c}{\varepsilon }\biggr ) \,\mathrm {d}\gamma _{\setminus j}^\otimes . \end{aligned}$$

Plugging this into the equation above, we get

$$\begin{aligned} {\mathcal {D}}_\varepsilon ^\varphi (f) =&- \sum _{i \ne j} \langle \mu _i, \varphi _i^*(-f_i) \rangle - \int _{{\mathbb {X}}_j} \varphi _j^*(-f_j)\\&+ \varepsilon \exp \Bigl (\frac{f_j - f^{(c,\varepsilon ,j)}}{\varepsilon }\Bigr ) \,\mathrm {d}\mu _j. \end{aligned}$$

The integrand on the right-hand side has the form of the functional in (8), so that we obtain

$$\begin{aligned}&{\mathcal {D}}_\varepsilon ^\varphi (f) \le {\mathcal {D}}_\varepsilon ^\varphi (\dotsc ,f_{j-1},f^{(c,\varepsilon ,\varphi ,j)},f_{j+1},\dotsc ). \end{aligned}$$

Since the minimization problem (8) admits a unique solution, strict inequality holds if and only if \(f_j \ne f^{(c,\varepsilon ,\varphi ,j)}\).

By the first part of the proof, the relation \(f_j = f^{(c,\varepsilon ,\varphi ,j)}\) is equivalent to \(0 \in \partial _j {\mathcal {D}}_\varepsilon ^\varphi (f)\), and the last statement follows if we show that

Using \(F_i:L^\infty ({\mathbb {X}}_i, \gamma _i) \rightarrow [-\infty ,\infty )\) as well as \(G:L^{\infty ,\times }({\mathbb {X}},\gamma ) \rightarrow {\mathbb {R}}\) given by

$$\begin{aligned}&F_i(f) = -\int _{{\mathbb {X}}_i} \varphi _i^* \circ (-f) \,\mathrm {d}\mu _i, \quad i = 1,\dotsc ,N,\\&G(f) = -\varepsilon \int _{{\mathbb {X}}} \exp \Bigl (\frac{f - c}{\varepsilon }\Bigr ) \,\mathrm {d}\gamma ^\otimes , \end{aligned}$$

respectively, we can decompose \({\mathcal {D}}_\varepsilon ^\varphi \) as \({\mathcal {D}}_\varepsilon ^\varphi = F^\oplus + G\). As G is Gâteaux-differentiable, we obtain that . By continuity of G and since \(0\in {{\,\mathrm{dom}\,}}\varphi ^*\), it holds \(0\in {{\,\mathrm{dom}\,}}G \cap {{\,\mathrm{dom}\,}}F^\oplus \), such that the subdifferentials are additive by [23, Ch. 1, Prop. 5.6]. Thus, using , we obtain

This concludes the proof. \(\square \)

Inspired by the Sinkhorn iterations for \(\mathrm {MOT}_\varepsilon \) in [20] and \(\mathrm {UOT}_\varepsilon \) in [47], we propose Algorithm 1 for solving \(\mathrm {UMOT}_\varepsilon \) in its dual form (3). By Proposition 4.7, every fixed point of the sequence \((f^{(rN)})_{r \in {\mathbb {N}}}\) generated by Algorithm 1 is a solution of (3).

Remark 4.8

It holds \(\varphi ^*(0)=0\). Hence, we can choose \(f^{(0)}=0\) as an initialization with \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) > -\infty \).

Next, we want to show that the sequence converges. Note that in [20, Thm. 4.7] convergence of the (rescaled) Sinkhorn algorithm was shown by exploiting the property \({\mathcal {D}}_\varepsilon ^\varphi (f_1,\dotsc ,f_N) = {\mathcal {D}}_\varepsilon ^\varphi (f_1 + \lambda _1,\dotsc ,f_N + \lambda _N)\) for all \(\lambda _1,\dotsc ,\lambda _N \in {\mathbb {R}}\) with \(\sum _{i=1}^N \lambda _i = 0\), which holds exclusively in the balanced case where \(\varphi _i^*(q) = q\). Hence, significant modifications of the proof are necessary. Albeit taking several ideas from [47], our approach differs as we cannot rely on the 1-Lipschitz continuity of the \((c,\varepsilon )\)-transform, which only holds for \(N=2\). Instead, we exploit the compactness of the Sinkhorn operator as in [20], for which we need to establish uniform boundedness of the iterates. To this end, we need the following two lemmata.

Lemma 4.9

Let \(f^{(n)} \in L^{\infty , \times } ({\mathbb {X}}, \gamma )\), \(n \in {\mathbb {N}}\), satisfy \(\Vert (f^{(n)})^\oplus \Vert _{\infty } \overset{n\rightarrow \infty }{\longrightarrow } \infty \) and have uniformly bounded oscillations (see Lemma 4.2). Then, it holds \({\mathcal {D}}_\varepsilon ^\varphi (f^{(n)}) \rightarrow -\infty \).

Proof

Since the entropy functions \(\varphi _i\), \(i=1,\ldots ,N\), satisfy \(\varphi _i(1) = 0\), we have \(\varphi _i^*(x) \ge x\) for all \(x \in {\mathbb {R}}\). Hence, we can estimate

$$\begin{aligned} - \sum _{i=1}^N \bigl \langle \mu _i, \varphi _i^*(-f^{(n)}_i) \bigr \rangle&\le \langle \mu ^\otimes , (f^{(n)})^\oplus \rangle . \end{aligned}$$

Since the \(\mu _i\) are absolutely continuous with respect to \(\gamma _i\) with density \(\sigma _{\mu _i}\), we obtain

$$\begin{aligned} {\mathcal {D}}_\varepsilon ^\varphi (f^{(n)})&\le \biggl \langle \mu ^\otimes , (f^{(n)})^\oplus \\&\quad -\varepsilon \exp \biggl (\frac{ (f^{(n)})^\oplus - c - \varepsilon \ln \bigl (\prod _{i=1}^N \sigma _{\mu _i}\bigr )}{\varepsilon }\biggr )\biggr \rangle . \end{aligned}$$

Clearly, \((f^{(n)})^\oplus \) has uniformly bounded oscillation. Hence, for \(\Vert (f^{(n)})^\oplus \Vert _{\infty } \rightarrow \infty \) the integrand diverges to \(-\infty \) on a set of positive measure, which yields the assertion. \(\square \)

Lemma 4.10

Let \(\ln (\sigma _{\mu _i}) \in L^\infty ({\mathbb {X}}_i,\gamma _i)\), \(i = 1,\dotsc ,N\), and \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) > -\infty \). For the Sinkhorn sequence \((f^{(n)})_{n\in {\mathbb {N}}} \subset L^{\infty ,\times }({\mathbb {X}},\gamma )\) generated by Algorithm 1, there exists a constant \(M>0\) and a sequence \((\lambda ^{(n)})_{n\in {\mathbb {N}}}\), \(\lambda ^{(n)} \in {\mathbb {R}}^N\), with \(\sum _{i=1}^N \lambda _i^{(n)} = 0\) such that

$$\begin{aligned} \bigl \Vert f_i^{(n)} + \lambda _i^{(n)} \bigr \Vert _{ \infty } < M. \end{aligned}$$

Proof

For \(i = 1,\dotsc ,N-1\), \(j = 0,\dotsc ,N-1\) and \(r \in {\mathbb {N}}\) set

$$\begin{aligned}&\lambda ^{(rN + j)}_i ~{:}{=}~ {\left\{ \begin{array}{ll} \text {aprox}^\varepsilon _{\varphi ^*_i}(\lambda _{f^{(rN+i-1)},i}) &{} \!\!\!\!\!\text {if } i \le j,\\ \text {aprox}^\varepsilon _{\varphi ^*_i}(\lambda _{f^{((r-1)N+i-1)},i})&{} \!\!\!\!\!\text {if } i > j , \end{array}\right. }\\&\lambda _N^{(rN + j)} ~{:}{=}~ - \sum _{i=1}^{N-1} \lambda _i^{(rN + j)}, \end{aligned}$$

where \(\lambda _{f,i}\) is defined as in Lemma 4.2. Since \(\text {aprox}\) is 1-Lipschitz, by Lemma 4.2 i) and the definition of the iterates, for \(i \le j\) we obtain that

$$\begin{aligned}&\bigl \Vert f^{(rN+j)}_i + \lambda _i^{(rN+j)} \bigr \Vert _{\infty }\\&\quad =\bigl \Vert \bigl ( f^{(rN+i-1)} \bigr )^{(c,\varepsilon ,\varphi ,i)} +\lambda _i^{(rN+j)} \bigr \Vert _{ \infty } \\&\quad =\bigl \Vert -\text {aprox}^\varepsilon _{\varphi ^*_i}\bigl (-(f^{(rN+i-1)} )^{(c,\varepsilon ,i)}\bigr ) \\&\qquad +\text {aprox}^\varepsilon _{\varphi ^*_i}(\lambda _{f^{(rN+i-1)},i}) \bigr \Vert _{ \infty } \\&\quad \le \bigl \Vert ( f^{(rN+i-1)} )^{(c,\varepsilon ,i)} + \lambda _{f^{(rN+i-1)},i} \bigr \Vert _{\infty } \\&\quad \le \Vert c \Vert _{\infty } + \sup _i \varepsilon \Vert \ln (\sigma _{\mu _i}) \Vert _{ \infty } \le M_1 \end{aligned}$$

for some \(M_1<\infty \). In the case \(N>i>j\), we have

$$\begin{aligned} f^{(rN+j)}_i=\bigl ( f^{((r-1)N+i-1)} \bigr )^{(c,\varepsilon ,\varphi ,i)}, \end{aligned}$$

such that this case works similarly. Thus, it remains to estimate the last component. By Proposition 4.7, it holds for \(n \in {\mathbb {N}}\) that \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) \le {\mathcal {D}}_\varepsilon ^\varphi (f^{(n)})\). Due to Lemma 4.9, this ensures the existence of \(M_2 > 0\) such that \(\Vert \bigoplus _{i=1}^N f^{(n)}_i \Vert _{\infty } \le M_2\) for all \(n \in {\mathbb {N}}\). Thus,

$$\begin{aligned}&\bigl \Vert f^{(rN+j)}_N + \lambda ^{(rN+j)}_N \bigr \Vert _{\infty }\\&\quad \le \Bigl \Vert \bigoplus _{i=1}^{N} f^{(rN+j)}_i \Bigr \Vert _{\infty }\\&\qquad + \Bigl \Vert \lambda ^{(rN+j)}_N - \bigoplus _{i=1}^{N-1} f^{(rN+j)}_i \Bigr \Vert _{\infty }\\&\quad \le M_2 + \sum _{i=1}^{N-1} \bigl \Vert f^{(rN+j)}_i + \lambda _i^{(rN+j)} \bigr \Vert _{\infty }\\&\quad \le M_2 + (N-1)M_1. \end{aligned}$$

For \(M ~{:}{=}~ M_2 + (N-1)M_1\) the assertion follows. \(\square \)

Now, we can prove convergence of the Sinkhorn iterates under mild additional assumptions on the entropy functions.

Theorem 4.11

Let \(\ln (\sigma _{\mu _i}) \in L^\infty ({\mathbb {X}}_i,\gamma _i)\), \(i = 1,\dotsc ,N\). Assume that \([0, \infty )\subset {{\,\mathrm{dom}\,}}\varphi _i\) for all \(i=1, \dots , N-1\), and that \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) > -\infty \). Then, the sequence \((f^{(n)})_{n \in {\mathbb {N}}}\) induced by Algorithm 1 satisfies \(\Vert (f^{(n)})^\oplus - {\hat{f}}^\oplus \Vert _p \xrightarrow []{n \rightarrow \infty } 0\) for any optimal solution \({{\hat{f}}} \in L^{\infty ,\times }({\mathbb {X}},\gamma )\) of (3) for every \(p \in [1,\infty )\). Moreover, if \(N-1\) of the \(\varphi _i^*\) are strictly convex, then the optimal \({\hat{f}}\) is unique and \(\Vert f^{(n)} - {\hat{f}} \Vert _p \rightarrow 0\) for every \(p \in [1,\infty )\).

Proof

First, we show that the sequence \((f^{(n)})_{n \in {\mathbb {N}}}\) is uniformly bounded. By Lemma 4.10, there exists a sequence \((\lambda ^{(n)})_{n\in {\mathbb {N}}} \subset {\mathbb {R}}^N\), with \(\sum _{i=1}^N \lambda _i^{(n)} = 0\) and \(M > 0\) such that

$$\begin{aligned} \bigl \Vert f_i^{(n)} + \lambda _i^{(n)} \bigr \Vert _{\infty } \le M \end{aligned}$$
(9)

for all \(i=1,\ldots ,N\), \(n\in {\mathbb {N}}\). Define \(g_i^{(n)} ~{:}{=}~ f_i^{(n)} + \lambda _i^{(n)}\). To obtain uniform boundedness, it suffices to show that \(\max _i |\lambda _i^{(n)}|\) is uniformly bounded in n. We have for any \(q_i \in {{\,\mathrm{dom}\,}}{\varphi _i^*}\) and \(p_i \in \partial \varphi _i^*(q_i)\) by the first-order convexity condition and since elements of \(\partial \varphi _i^*(q_i)\) are nonnegative, that

$$\begin{aligned}&\bigl \langle \mu _i, - \varphi _i^*(-f^{(n)}_i)\bigr \rangle \\&\quad = \bigl \langle \mu _i, - \varphi _i^*(-g_i^{(n)} + \lambda _i^{(n)})\bigr \rangle \\&\quad \le \Bigl \langle \mu _i, -\varphi _i^*(q_i) + p_i \bigl (g_i^{(n)} - \lambda _i^{(n)} + q_i\bigr )\Bigr \rangle \\&\quad \le \Bigl \langle \mu _i, - \varphi _i^*(q_i) + p_i \bigl (\bigl \Vert g_i^{(n)} \bigr \Vert _{\infty } - \lambda _i^{(n)} + q_i\bigr ) \Bigr \rangle . \end{aligned}$$

Consequently, we obtain by (9) that

$$\begin{aligned}&\sum _{i=1}^N \bigl \langle \mu _i, - \varphi _i^*(-f_i^{(n)})\bigr \rangle \le -\sum _{i=1}^N \mu _i ({\mathbb {X}}_i) p_i \, \lambda _i^{(n)}\\&\quad +\underbrace{\sum _{i=1}^N \mu _i({\mathbb {X}}_i) \bigl (p_i (M + q_i)-\varphi _i^*(q_i)\bigr )}_{=:K(p_i, q_i)}. \end{aligned}$$

Setting \(m_i ~{:}{=}~ \mu _i ({\mathbb {X}}_i)\), \(i=1,\ldots ,N\), and using that the \(\lambda ^{(n)}_i\) sum up to zero, we conclude

$$\begin{aligned}&\sum _{i=1}^N \bigl \langle \mu _i, - \varphi _i^*(-f_i^{(n)})\bigr \rangle \nonumber \\&\quad \le \sum _{i=1}^{N-1} \bigl (m_N p_N- m_i p_i) \lambda ^{(n)}_i + K(p_i, q_i).\quad \end{aligned}$$
(10)

First, since \({{\,\mathrm{dom}\,}}\varphi _N^* =(-\infty , (\varphi _N)'_\infty )\) with \((\varphi _N)'_\infty \ge 0\), we can fix some \(q_N\in \mathrm {int}({{\,\mathrm{dom}\,}}\varphi _N^*)\) and some \(p_N\in \partial \varphi _N^*(q_N)\ne \emptyset \). Assume that there exists at least one \(i \in \{1,\ldots ,N\}\) such that \({(\lambda _i^{(n)})_{n \in {\mathbb {N}}}}\) is unbounded. Since the \({\lambda _i^{(n)}}\) sum to zero, there then also exists at least one such \(i \in \{1,\ldots ,N-1\}\). For each of these i, we can extract a subsequence \((n_{k})_k\) such that either \(\lambda _i^{(n_k)} \rightarrow -\infty \) or \(\lambda _i^{(n_k)} \rightarrow \infty \). In the first case, choose some \(p_i > 0\) small enough such that \(m_N p_N > m_i p_i\). By assumption, \(p_i\in \mathrm {int}({{\,\mathrm{dom}\,}}\varphi _i)\), such that there exists \(q_i\in \partial \varphi _i(p_i)\). Since \(\varphi \in \Gamma _0({\mathbb {R}})\), it follows that \(p_i\in \partial \varphi _i^*(q_i)\). Moreover, since

$$\begin{aligned} \infty > p_i q_i = \varphi _i(p_i) + \varphi _i^*(q_i) \end{aligned}$$

and \(p_i\in {{\,\mathrm{dom}\,}}\varphi _i\), we also have \(q_i\in {{\,\mathrm{dom}\,}}(\varphi _i^*)\). Similarly, if \(\lambda _i^{(n_k)} \rightarrow \infty \), choose some \(p_i<\infty \) such that \(m_N p_N < m_i p_i\) and \(q_i\in {{\,\mathrm{dom}\,}}\varphi _i^*\) with \(p_i\in \partial \varphi _i^*(q_i)\). Now, we have by Proposition 4.7 that \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) \le {\mathcal {D}}_\varepsilon ^\varphi (f^{(n_{k})})\). Since \((f^{(n_k)})^\oplus = (g^{(n_k)})^\oplus \) and the \(g_i^{(n_k)}\), \(i=1,\ldots ,N\), are uniformly bounded, the second summand in \({\mathcal {D}}_\varepsilon ^\varphi (f^{(n_k)})\) remains bounded as \(k\rightarrow \infty \), while the first summand in \({\mathcal {D}}_\varepsilon ^\varphi (f^{(n_k)})\) goes to \(-\infty \) by (10) with the above chosen \((q_i, p_i)\). This is a contradiction to \({\mathcal {D}}_\varepsilon ^\varphi (f^{(0)}) > -\infty \). Thus, there is \({\tilde{M}}>0\) such that for \(i \in \{1,\dotsc ,N\}\) and \(n \in {\mathbb {N}}\) it holds

$$\begin{aligned} \bigl \Vert f_i^{(n)}\bigr \Vert _{\infty }&\le \bigr \Vert f_i^{(n)} + \lambda _i^{(n)}\bigl \Vert _{\infty } + \bigl \vert \lambda _i^{(n)} \bigr \vert \\&\le M + {\tilde{M}}. \end{aligned}$$

Hence, \((f^{(n)})_n\) is a uniformly bounded sequence. By Lemma 4.2 ii), we know that the operator \(S^{(c,\varepsilon ,\varphi ,j)} :L^{\infty ,\times }({\mathbb {X}},\gamma )\rightarrow L^p({\mathbb {X}}_j,\gamma _j)\) is compact for every j and \(p \in [1,\infty )\). Consequently, we get existence of a converging subsequence \((f^{(n_k)})_{k \in {\mathbb {N}}}\) in \(L^{p,\times }({\mathbb {X}},\gamma )\). As \(f^{(n_k)}\) is uniformly bounded in \(L^{\infty ,\times }({\mathbb {X}},\gamma )\) and since \(\infty \)-norm balls are closed under \(L^p\) convergence, we get that the limit \({{\hat{f}}}\) additionally satisfies \({{\hat{f}}} \in L^{\infty ,\times }({\mathbb {X}},\gamma )\).

Now, we prove optimality of \({\hat{f}}\). Note that there is \(j \in \{0,\dotsc ,N-1\}\) so that \(n_k \equiv j \mod N\) for infinitely many \(k \in {\mathbb {N}}\). Without loss of generality, assume \(j = 0\). Then, we restrict \((n_k)_k\) to \(n_k \equiv 0 \mod N\) for all \(k \in {\mathbb {N}}\). Using the Lipschitz continuity of \(\ln \) and \(\exp \) on compact sets and the 1-Lipschitz continuity of \(\text {aprox}\) together with the uniform boundedness of the sequence \((f^{(n)})_{n \in {\mathbb {N}}}\), we obtain that

$$\begin{aligned}&\bigl \Vert \bigl (f^{(n_k)}\bigr )^{(c,\varepsilon ,\varphi ,j)} - {\hat{f}}^{(c,\varepsilon ,\varphi ,j)} \bigr \Vert ^p_{p} \\&\quad \le \bigl \Vert \bigl (f^{(n_k)}\bigr )^{(c,\varepsilon ,j)} - {\hat{f}}^{(c,\varepsilon ,j)} \bigr \Vert ^p_{p}\\&\quad \le C \, \biggl \Vert \int _{{\mathbb {X}}_{\setminus j}} \exp \biggl (\frac{(f^{(n_k)})_{\setminus j}^\oplus - c}{\varepsilon }\biggr ) \\&\qquad \,\,\,- \exp \biggl (\frac{{{{\hat{f}}}}_{\setminus j}^\oplus - c}{\varepsilon }\biggr ) \,\mathrm {d}\gamma _{\setminus j}^\otimes \biggr \Vert ^p_{p} \\&\quad \le C \int _{{\mathbb {X}}} \bigl \vert (f^{(n_k)})_{\setminus j}^\oplus - {{{\hat{f}}}}_{\setminus j}^\oplus \bigr \vert ^p \,\mathrm {d}\gamma ^\otimes \\&\quad \le C \bigl \Vert f^{(n_k)} - {{\hat{f}}} \bigr \Vert ^p_{p}\rightarrow 0 \end{aligned}$$

for every \(j=1,\ldots ,N\), where C stands for some unspecified, possibly changing constant. In particular, it holds

$$\begin{aligned} f_1^{(n_k+1)} \rightarrow&{\hat{f}}^{(c,\varepsilon ,\varphi ,1)} \in L^p({\mathbb {X}}_1, \gamma _1). \end{aligned}$$

As all \(\varphi _i^*\) are lsc, the dominated convergence theorem implies that \({\mathcal {D}}_\varepsilon ^\varphi (f^{(n_j)}) \rightarrow {\mathcal {D}}_\varepsilon ^\varphi ({{\tilde{f}}})\) for any a.e. convergent subsequence \((f^{(n_j)})_{j \in {\mathbb {N}}}\) of \((f^{(n)})_{n \in {\mathbb {N}}}\) with limit \({{\tilde{f}}}\). Due to this continuity property and since \(({\mathcal {D}}_\varepsilon ^\varphi (f^{(n)}))_n\) is a convergent sequence, we get

$$\begin{aligned}&{\mathcal {D}}_\varepsilon ^\varphi \bigl ({\hat{f}}^{(c,\varepsilon ,\varphi ,1)},{\hat{f}}_2,\dotsc ,{\hat{f}}_N\bigr )\\&\quad = \lim _{k \rightarrow \infty } {\mathcal {D}}_\varepsilon ^\varphi \bigl (f^{(n_k+1)}\bigr ) = \lim _{k \rightarrow \infty } {\mathcal {D}}_\varepsilon ^\varphi \bigl (f^{(n_k)}\bigr )\\&\quad ={\mathcal {D}}_\varepsilon ^\varphi ({\hat{f}}). \end{aligned}$$

Consequently, Proposition 4.7 implies \({\hat{f}}_1 = {\hat{f}}^{(c,\varepsilon ,\varphi ,1)}\). In the same way, we obtain

$$\begin{aligned} {\mathcal {D}}_\varepsilon ^\varphi \bigl ({\hat{f}}^{(c,\varepsilon ,\varphi ,1)},{\hat{f}}^{(c,\varepsilon ,\varphi ,2)},{\hat{f}}_3\dotsc ,{\hat{f}}_N\bigr ) = {\mathcal {D}}_\varepsilon ^\varphi ({\hat{f}}). \end{aligned}$$

Due to \({\hat{f}}_1 = {\hat{f}}^{(c,\varepsilon ,\varphi ,1)}\), this gives \({\hat{f}}_2 = {\hat{f}}^{(c,\varepsilon ,\varphi ,2)}\). Proceeding iteratively this way, we obtain that \({\hat{f}}_i = {\hat{f}}^{(c,\varepsilon ,\varphi ,i)}\) for all \(i=1, \dots , N\). Hence, Proposition 4.7 implies that \({\hat{f}}\) is an optimal dual vector.

If \(N-1\) of the \(\varphi ^*\) are strictly convex, then the maximizer in (3) is unique and \((f^{(n)})_{n \in {\mathbb {N}}}\) converges. Otherwise, we obtain convergence of \(((f^{(n_j)})^\oplus )_{j \in {\mathbb {N}}}\) to \({\hat{f}}^\oplus \) in \(L^p({\mathbb {X}},\gamma ^\otimes )\). Since \({\hat{f}}^\oplus \) is the same for all possible limit points \({\hat{f}}\), convergence of \(((f^{(n)})^\oplus )_{n \in {\mathbb {N}}}\) follows. \(\square \)

Remark 4.12

(Relation to UOT in [47]) For entropy functions satisfying the assumptions of Theorem 4.11, our result can be seen as a generalization of the UOT result in Séjourné et al. [47] to the multi-marginal case and non-Lipschitz costs. Notably, the result by Séjourné et al. covers some additional entropy functions.

Finally, we want to remark that all results of this section also hold true if we do not assume that the spaces \({\mathbb {X}}_i\), \(i=1,\ldots ,N\), are compact as long as the cost function c remains bounded.

5 Barycenters and Tree-Structured Costs

In this section, we are interested in the computation of \(\mathrm {UOT}_\varepsilon \) barycenters and their relation to \(\mathrm {UMOT}_\varepsilon \) with tree-structured costs. An undirected graph \({{\,\mathrm{{\mathcal {G}}}\,}}= ({{\,\mathrm{{\mathcal {V}}}\,}},{{\,\mathrm{{\mathcal {E}}}\,}})\) with N nodes \({{\,\mathrm{{\mathcal {V}}}\,}}= \{1,\ldots ,N\}\) and edges \({{\,\mathrm{{\mathcal {E}}}\,}}\) is a tree if it is acyclic and connected. We write \(e = (j,k)\) if e joins the nodes j and k, where we agree that \(j < k\) in order to count edges only once. Let \(\deg (i)\) denote the number of edges in node \(i \in {\mathcal {V}}\). A node \(i \in {\mathcal {V}}\) is called a leaf if \(\deg (i) = 1\). By \({\mathcal {N}}_j\), we denote the set of neighbors of node j. For a given tree, let \(t = (t_e)_{e \in {{\,\mathrm{{\mathcal {E}}}\,}}}\) with \(t_{e} \in [0,1]\) and \(\sum _{e \in {{\,\mathrm{{\mathcal {E}}}\,}}} t_e = 1\). A cost function \(c_t\) is said to be tree-structured, if it is of the form

$$\begin{aligned} c_t = \sum _{e \in {{\,\mathrm{{\mathcal {E}}}\,}}} t_e c_e = \sum _{(j,k) \in {{\,\mathrm{{\mathcal {E}}}\,}}} t_{(j,k)} c_{(j,k)}. \end{aligned}$$
(11)

In Sect. 5.1, we consider the case where the tree is star-shaped, i.e., \({{\,\mathrm{{\mathcal {E}}}\,}}=\{ (1, N), \dots , (N-1, N) \}\), see Fig. 1 left. General tree-structured costs as, for example, those in Fig. 1 middle and right are addressed in Sect. 5.2. We restrict our attention to \({\mathbb {X}}_i\), \(i=1,\ldots ,N\), which are either finite or compact subsets of \({\mathbb {R}}^{d_i}\). Moreover, all references measures \(\gamma _i\), \(i=1,\ldots ,N\), are counting measures, respectively, Lebesgue measures, so that we regularize exclusively with the entropy from Rem. 3.2 ii) or iii), respectively.

Fig. 1
figure 1

Examples of trees: star-shaped (left), H-shaped (middle), line-shaped (right)

5.1 Barycenters

For the barycenter problem with respect to \(\mathrm {UOT}_\varepsilon \), we introduce an additional finite set \({\mathbb {Y}}\) (or compact set \({\mathbb {Y}}\subset {\mathbb {R}}^{d_{N+1}}\)) and use for \(\xi \in {\mathcal {M}}^+({\mathbb {Y}})\) the entropy function

$$\begin{aligned} \psi ~{:}{=}~ \iota _{\{1\}}, \quad D_\psi (\cdot , \xi ) = \iota _{\{\xi \}}, \end{aligned}$$
(12)

from Example 2.1 i). Let \(c_i \in C({\mathbb {X}}_i \times {\mathbb {Y}})\), \(i=1,\ldots ,N\), be nonnegative cost functions. The corresponding tree is star-shaped, i.e., given by \({{\,\mathrm{{\mathcal {V}}}\,}}=\{ 1, \dots , N+1\}\) and \({{\,\mathrm{{\mathcal {E}}}\,}}=\{ (1, N+1), \dots , (N,N+1) \}.\) To emphasize the dependence of \({{\,\mathrm{UOT}\,}}_\varepsilon \) on these functions, we write \(\mathrm {UOT}_\varepsilon ^{(c_i,\varphi _i,\psi )}\). We use an analogous notation for \({{\,\mathrm{UMOT}\,}}_\varepsilon \). Let

$$\begin{aligned} \Delta _N ~{:}{=}~ \Bigl \{t = (t_i)_{i=1}^N: t_i \ge 0, \, \sum _{i=1}^N t_i = 1\Bigr \} \end{aligned}$$

be the \((N-1)\)-dimensional probability simplex. For given barycentric coordinates \(t \in \Delta _N\), the barycenter \(\hat{\xi } \in {\mathcal {M}}^+({\mathbb {Y}})\) of \(\mu \) with respect to \(\mathrm {UOT}_{\varepsilon }\) is given by

$$\begin{aligned} {\hat{\xi }}&~{:}{=}~ \mathop {\mathrm {argmin}}\limits _{\xi \in {\mathcal {M}}^+({\mathbb {Y}})} \sum _{i=1}^N t_i {{\,\mathrm{UOT}\,}}_{\varepsilon /t_i}^{(c_i,\varphi _i,\psi )}(\mu _i,\xi ),\\&= \mathop {\mathrm {argmin}}\limits _{\xi \in {\mathcal {M}}^+({\mathbb {Y}})} \sum _{i = 1}^N \min _{\pi ^{(i)} \in {\mathcal {M}}^+({\mathbb {X}}_i \times {\mathbb {Y}})} \int _{{\mathbb {X}}_i \times {\mathbb {Y}}}\!\! t_i c_i \,\mathrm {d}\pi ^{(i)}\nonumber \\&\quad + \varepsilon E \bigl (\pi ^{(i)}\bigr ) + t_i D_{\varphi _i}\bigl (P_{{\mathbb {X}}_{i\#}}\pi ^{(i)} , \mu _i\bigr ), \nonumber \\&\quad \; \mathrm {subject \; to} \quad P_{{\mathbb {Y}}\#}\pi ^{(i)} = \xi , \; i = 1,\ldots ,N. \nonumber \end{aligned}$$
(13)

Note that by the choice of \(\psi \) the barycentric marginal \(\xi \) is exactly matched. By Proposition 3.3, the involved \({{\,\mathrm{UOT}\,}}\) problems have unique solutions. Moreover, it was shown in [12, Sec. 5.2] that a unique barycenter exists due to the regularization. However, these barycenters do not correspond to a shortest path since \({{\,\mathrm{UOT}\,}}_\varepsilon \) is not a metric.

To establish a relation with the multi-marginal setting, we exploit that the optimal plan of the multi-marginal problem with cost function

$$\begin{aligned} c_t ~{:}{=}~ \sum _{i=1}^N t_i c_i \in C({\mathbb {X}}\times {\mathbb {Y}}) \end{aligned}$$
(14)

is readily determined by its marginals. We need the following auxiliary lemma.

Lemma 5.1

Let \(\pi \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}})\) be absolutely continuous with respect to Lebesgue measure, respectively, counting measure, on \({\mathbb {X}}\times {\mathbb {Y}}\) with density \(\sigma _\pi \). If there exists \(a_i \in L^\infty ({\mathbb {X}}_i \times {\mathbb {Y}})\), \(i = 1,\ldots ,N\), such that \(\sigma _{\pi } = \prod _{i=1}^N a_i\), then \(\pi \) is related to its marginals \(\pi _i ~{:}{=}~ {P_{{\mathbb {X}}_i \times {\mathbb {Y}}}}_\# \pi \), \(i = 1,\ldots ,N\), and \(\pi _{N+1} ~{:}{=}~ {P_{{\mathbb {Y}}}}_\# \pi \) via

$$\begin{aligned} \sigma _{\pi } = \frac{\sigma _{\pi _{1}} \cdots \sigma _{\pi _{ N}}}{{\sigma ^{N-1}_{\pi _{N+1}}}}, \end{aligned}$$

where \(\sigma _{\pi _i}\) denotes the density of \(\pi _i\) with respect to the Lebesgue, respectively, counting measure, on \({\mathbb {X}}_i \times {\mathbb {Y}}\).

Proof

By abuse of notation, we denote the Lebesgue measure, respectively, counting measure, by \(\lambda \). The underlying space becomes clear from the context. By assumption, the marginal densities of \(\pi \) read

$$\begin{aligned} \sigma _{\pi _j}(x_j,y) = a_j(x_j,y) \prod _{i \ne j} \int _{{\mathbb {X}}_i} a_i(x_i, y) \,\mathrm {d}\lambda (x_i), \end{aligned}$$

\(j = 1,\dotsc ,N\). Since \(P_{{\mathbb {Y}}\#} \pi = \pi _{N+1}\), we have \(\sigma _{\pi _{N+1}} = \prod _{i = 1}^N \int _{{\mathbb {X}}_i} a_i (x_i, \cdot ) \,\mathrm {d}\lambda (x_i)\), which finally yields

$$\begin{aligned} \sigma _{\pi }&= \prod _{i=1}^N a_i\\&=\frac{ \sigma _{\pi _{1} } \ldots \sigma _{\pi _{N}} }{\prod _{j=1}^N \Bigl (\prod _{i \ne j} \int _{{\mathbb {X}}_i} a_i (x_i, \cdot )\,\mathrm {d}\lambda _i(x_i)\Bigr ) }\\&= \frac{ \sigma _{\pi _{1}} \ldots \sigma _{\pi _{N}} }{\sigma ^{N-1}_{\pi _{N+1}}}. \square \end{aligned}$$

Now, we can draw the aforementioned connection between the barycenter problem (13) and \({{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}\), where

$$\begin{aligned} \varphi _{N+1}&~{:}{=}~ \iota _{[0,\infty )},\nonumber \\ D_{\varphi _{N+1}}(\cdot , \xi )&= 0, \quad \xi \in {\mathcal {M}}^+({\mathbb {Y}}), \end{aligned}$$
(15)

see Example 2.1ii). Due to the special form of this entropy, the \((N+1)\)th input of \({{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}\) has no effect on the functional. To emphasize this, we use the notation

$$\begin{aligned}&{{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})} (\mu ,*)\nonumber \\&\quad ~{:}{=}~ \min _{ \pi \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}}) } \int _{{\mathbb {X}}\times {\mathbb {Y}}} c_t \,\mathrm {d}\pi + \varepsilon E(\pi ) \nonumber \\&\qquad +\sum _{i=1}^N D_{t_i \varphi _i}(P_{{\mathbb {X}}_{i\#}}\pi , \mu _i). \end{aligned}$$
(16)

The next theorem establishes the relation between the barycenter problem and this multi-marginal optimal transport problem.

Theorem 5.2

For \(i = 1,\ldots ,N\), let \(\mu _i \in {\mathcal {M}}^+({\mathbb {X}}_i)\) and cost functions \(c_i\in C({\mathbb {X}}_i \times {\mathbb {Y}})\) be given. Define \(c_t \in C({\mathbb {X}}\times {\mathbb {Y}})\) by (14), \(\psi \) by (12) and \(\varphi _{N+1}\) by (15). If \(\hat{\pi }\) is optimal for (16) then \(P_{{\mathbb {Y}}\#} \hat{\pi }\) minimizes the functional \(F(\xi )\) given by

$$\begin{aligned} \sum _{i=1}^N t_i {{\,\mathrm{UOT}\,}}_{\varepsilon /t_i}^{(c_i,\varphi _i,\psi )}(\mu _i,\xi ) - \varepsilon (N-1) E(\xi ).\quad \end{aligned}$$
(17)

Proof

Again, let \(\lambda \) denote the Lebesgue or counting measure, respectively. Set

$$\begin{aligned} C = \varepsilon \lambda ({\mathbb {Y}}) \biggl (\sum _{i=1}^N \lambda ({\mathbb {X}}_i) - (N-1) - \prod _{i=1}^N \lambda ({\mathbb {X}}_i)\biggr ). \end{aligned}$$

Estimating in both directions, we show that

$$\begin{aligned}&\inf _{\xi \in {\mathcal {M}}^+({\mathbb {Y}})} F(\xi ) = {{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}(\mu ,*) + C. \end{aligned}$$

1. Fix \(\xi \in {\mathcal {M}}^+({\mathbb {Y}})\) such that optimal plans \({\hat{\pi }}^{(i)} \in {\mathcal {M}}^+({\mathbb {X}}_i \times {\mathbb {Y}})\) for \({{\,\mathrm{UOT}\,}}_\varepsilon ^{(c_i,\varphi _i,\psi )}(\mu _i,\xi )\), \(i = 1,\dotsc ,N\), exist. Then, we define

$$\begin{aligned} \pi ~{:}{=}~ \frac{\sigma _{{\hat{\pi }}^{(1)}} \cdots \sigma _{{\hat{\pi }}^{(N)}}}{{\sigma _\xi }^{N-1}} \, \lambda ^\otimes \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}}), \end{aligned}$$

which yields \(P_{{\mathbb {X}}_i \times {\mathbb {Y}}\#} \pi = {\hat{\pi }}^{(i)}\) and \(P_{{\mathbb {Y}}\#} \pi = \xi \). Consequently, we get

$$\begin{aligned} F(\xi )&=\sum _{i = 1}^N \biggl (\int _{{\mathbb {X}}_i \times {\mathbb {Y}}} t_i c_i \,\mathrm {d}{\hat{\pi }}^{(i)} + \varepsilon E\bigl ({\hat{\pi }}^{(i)}\bigr )\biggr ) \\&\quad +\, \sum _{i = 1}^N t_i D_{\varphi _i}\bigl (P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}^{(i)} , \mu _i\bigr ) - \varepsilon (N-1) E(\xi )\\&= \int _{{\mathbb {X}}\times {\mathbb {Y}}} \!\!c_t \,\mathrm {d}\pi + \varepsilon \Bigl ( \sum _{i = 1}^N E\bigl ({\hat{\pi }}^{(i)}\bigr ) \!-\! (N-1) E(\xi ) \Bigr ) \\&\quad +\, \sum _{i = 1}^N t_i D_{\varphi _i}\bigl (P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}^{(i)} , \mu _i\bigr ). \end{aligned}$$

Using the definition of \(\pi \), we obtain

$$\begin{aligned}&\sum _{i = 1}^N E\bigl ({\hat{\pi }}^{(i)}\bigr ) - (N-1) E(\xi )\\&\quad = \sum _{i = 1}^N \biggl (\int _{{\mathbb {X}}_i \times {\mathbb {Y}}} \ln \bigl (\sigma _{{\hat{\pi }}^{(i)}}\bigr ) \,\mathrm {d}{\hat{\pi }}^{(i)} + \lambda ({\mathbb {X}}_i \times {\mathbb {Y}})\biggr )\\&\qquad - (N-1) \biggl (\int _{{\mathbb {Y}}} \ln (\sigma _\xi ) \,\mathrm {d}\xi + \lambda ({\mathbb {Y}})\biggr )\\&\qquad \underbrace{-\sum _{i=1}^N \pi ^{(i)}({\mathbb {X}}_i \times {\mathbb {Y}}) + (N-1)\xi ({\mathbb {Y}})}_{= -\pi ({\mathbb {X}}\times {\mathbb {Y}})}\\&\quad = \int _{{\mathbb {X}}\times {\mathbb {Y}}} \ln (\sigma _\pi ) \,\mathrm {d}\pi - \pi ({\mathbb {X}}\times {\mathbb {Y}})\\&\qquad + \sum _{i=1}^N \lambda ({\mathbb {X}}_i \times {\mathbb {Y}}) - (N-1)\lambda ({\mathbb {Y}})\\&\quad = E(\pi ) + C/\varepsilon . \end{aligned}$$

Incorporating \(t_i D_{\varphi _i} = D_{t_i \varphi _i}\), we obtain

$$\begin{aligned} F(\xi ) =&\int _{{\mathbb {X}}\times {\mathbb {Y}}} c_t \,\mathrm {d}\pi + \varepsilon E(\pi )+C \\&\quad + \sum _{i=1}^N D_{t_i \varphi _i}(P_{{\mathbb {X}}_{i\#}} \pi , \mu _i) . \end{aligned}$$

Thus, minimizing the right-hand side over all \(\pi \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}})\), we get

$$\begin{aligned} F(\xi ) \ge {{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}(\mu ,*) + C, \end{aligned}$$

such that minimizing the left-hand side over all \(\xi \in {\mathcal {M}}^+({\mathbb {Y}})\) yields the desired estimate.

2. Next, we show the converse estimate. Let \({\hat{\pi }} \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}})\) be the optimal plan and \(({{\hat{f}}}, {{\hat{g}}}) \in L^{\infty ,\times }({\mathbb {X}}) \times L^\infty ({\mathbb {Y}})\) be optimal dual potentials for \({{\,\mathrm{UMOT}\,}}_\varepsilon ^{ (c_t,t\varphi ,\varphi _{N+1})}\left( \mu ,*\right) \). For \(i = 1,\dotsc ,N\), define \(a_i \in L^\infty ({\mathbb {X}}_i \times {\mathbb {Y}})\) by

$$\begin{aligned} a_1&~{:}{=}~ \exp \Bigl (\frac{{{\hat{f}}}_1 + {{\hat{g}}} - t_1 c_1}{\varepsilon }\Bigr ) \quad \text { and }\\ a_i&~{:}{=}~ \exp \Bigl ( \frac{{{\hat{f}}}_i - t_i c_i}{\varepsilon }\Bigr ), \quad i = 2,\dotsc ,N. \end{aligned}$$

The definition of \(c_t\) together with (5) yields

$$\begin{aligned} \sigma _{{\hat{\pi }}} = \exp \Bigl ( \frac{{{\hat{f}}}^\oplus + {{\hat{g}}} - c_t}{\varepsilon } \Bigr ) = \prod _{i=1}^N a_i. \end{aligned}$$

Then, Lemma 5.1 implies

$$\begin{aligned} {\hat{\pi }} = (\sigma _{{\hat{\pi }}_{1}} \ldots \sigma _{{\hat{\pi }}_{N}}/{\sigma _\xi }^{N-1})\lambda ^\otimes , \end{aligned}$$

where \(\hat{\pi }_i ~{:}{=}~ {P_{{\mathbb {X}}_i \times {\mathbb {Y}}}}_\# \hat{\pi }\), \(i = 1,\ldots ,N\), and \(\xi ~{:}{=}~ P_{{\mathbb {Y}}\#}{\hat{\pi }}\). Similarly as for the previous considerations, this results in

$$\begin{aligned}&{{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}(\mu ,*) + C\\&\quad =\sum _{i = 1}^N \biggl ( \int _{{\mathbb {X}}_i \times {\mathbb {Y}}} t_i c_i \,\mathrm {d}{\hat{\pi }}_{i} +\varepsilon E({\hat{\pi }}_{i})\biggr )\\&\qquad + \sum _{i = 1}^N t_i D_{\varphi _i}\bigl (P_{{\mathbb {X}}_{i\#}}{\hat{\pi }}_{i} , \mu _i\bigr ) - \varepsilon (N-1) E(\xi )\\&\quad \ge F(\xi ). \end{aligned}$$

Minimizing the right-hand side over all \(\xi \in {\mathcal {M}}^+({\mathbb {Y}})\) yields

$$\begin{aligned} \! {{\,\mathrm{UMOT}\,}}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}(\mu ,*) + C \ge \!\!\inf _{\xi \in {\mathcal {M}}^+({\mathbb {Y}})} \! F(\xi ) \end{aligned}$$

and thus the desired equality. As a direct consequence, we get \(P_{{\mathbb {Y}}\#} {\hat{\pi }}\) minimizes (17). This concludes the proof. \(\square \)

Remark 5.3

The previous result directly generalizes to \({{\,\mathrm{UMOT}\,}}\) problems that also regularize the marginal on \({\mathbb {Y}}\). More precisely, let \(\mu _{N+1} \in {\mathcal {M}}^+({\mathbb {Y}})\) and \(\varphi _{N+1} \in \Gamma _0({\mathbb {R}})\) be an arbitrary entropy function. If \(\hat{\pi }\) is the optimal plan for \(\mathrm {UMOT}_\varepsilon ^{(c_t,t\varphi ,\varphi _{N+1})}\left( \mu ,\mu _{N+1}\right) \), then \(P_{{\mathbb {Y}}\#} \hat{\pi }\) solves

$$\begin{aligned}&\min _{\xi \in {\mathcal {M}}^+({\mathbb {Y}})} \sum _{i=1}^N t_i {{\,\mathrm{UOT}\,}}_{\varepsilon /t_i}^{(c_i,\varphi _i,\psi )}(\mu _i,\xi ) \nonumber \\&\quad - \varepsilon (N-1) E(\xi ) + D_{\varphi _{N+1}}(\xi ,\mu _{N+1}). \end{aligned}$$

Remark 5.4

(Comparison of formulations (13) and (16)) The proof of Theorem 5.2 reveals that the barycenter \({\hat{\xi }}\) in (13) is “over-regularized,” since it appears as the marginal measure of \(\pi ^{(i)}\) in each of the N entropy terms \(E(\pi ^{(i)})\). On the other hand, the proposed multi-marginal approach does not involve these \(N-1\) superfluous regularizers. This ensures that the minimizer \({\hat{\xi }}\) is less “blurred” compared to the original \({{\,\mathrm{UOT}\,}}_\varepsilon \) barycenter, which is favorable for most applications. A numerical illustration of this behavior is given in Sect. 6. We will see in Subsect. 5.3 that for tree-structured costs the computation of optimal transport plans for the multi-marginal case has the same complexity per iteration as for the “barycentric” problems.

Furthermore, the computation of barycenters with an additional penalty term as outlined in Rem. 5.3 is possible with the Sinkhorn-type algorithm detailed in Sect. 5.3. In contrast, we are unaware of an efficient algorithm to solve the corresponding pairwise coupled formulation.

On the other hand, we only obtain the equivalence of the “pairwise coupled” formulation (13) and the multi-marginal approach for the choices made in (12) and (15). These enforce that all marginals of the plans in (13) coincide with the barycenter, which is not necessary in general. Although this generalization comes at the cost of a nested optimization problem, the pairwise coupled formulation is thus more flexible than \({{\,\mathrm{UMOT}\,}}\).

Remark 5.5

(Barycenters and MOT) By Theorem 5.2, formula (17) with \(\varepsilon = 0\) and \(\varphi _i = \iota _{\{1\}}\), \(i=1,\ldots ,N\), our MOT formulation with the cost

$$\begin{aligned} c_t(x_1,\ldots ,x_{N+1}) = \sum _i t_i c_i(x_i,x_{N+1}) \end{aligned}$$

is equivalent to the OT barycenter problem. There is another reformulation of the OT barycenter problem via the McCann interpolation using the cost function

$$\begin{aligned} c(x_1,\ldots ,x_N) ~{:}{=}~ \min _y \sum _{i=1}^N t_i c_i(x_i,y) \end{aligned}$$

if a unique minimizer exists, see [10]. To the best of our knowledge, there is no similar reformulation for our setting.

5.2 General Tree-Structured Costs

In the above barycenter problem, we have considered \({{\,\mathrm{UMOT}\,}}_\varepsilon \) with a tree-structured cost function, where the tree was just star-shaped. In the rest of this section, we briefly discuss an extension of the \({{\,\mathrm{UMOT}\,}}_\varepsilon \) problem to costs of the form (11), where \({{\,\mathrm{{\mathcal {G}}}\,}}=({{\,\mathrm{{\mathcal {V}}}\,}}, {{\,\mathrm{{\mathcal {E}}}\,}})\) is a general tree graph with cost functions \(c_{(j,k)} \in C({\mathbb {X}}_j\times {\mathbb {X}}_k)\) for all \((j, k)\in {{\,\mathrm{{\mathcal {E}}}\,}}\). For the balanced case, this topic was addressed in [32, Prop. 4.2].

For a disjoint decomposition

$$\begin{aligned} {\mathcal {V}} = V \cup U, \quad V \cap U = \emptyset , \end{aligned}$$

where V contains only leaves, and measures \(\mu _v \in {\mathcal {M}}^+({\mathbb {X}}_v)\), \(v \in V\), we want to find measures \(\mu _u \in {\mathcal {M}}^+({\mathbb {X}}_u)\), \(u \in U\), that solve the problem

$$\begin{aligned}&\inf _{(\mu _u)_{u \in U}} \sum _{(j,k) \in {{\,\mathrm{{\mathcal {E}}}\,}}} t_{(j,k)} {{\,\mathrm{UOT}\,}}_{\varepsilon /t_{(j,k)}}^{(c_{(j,k)}, \varphi _{j},\varphi _{k})}(\mu _{j},\mu _{k}). \end{aligned}$$
(18)

Again, we assume that the unknown marginals \(\mu _u\) are exactly matched, i.e., \(\varphi _u = \iota _{\{1\}}\), \(u \in U\).

Example 5.6

For the barycenter problem (13), we have \(V = \{1,\ldots ,N\}\) and the tree is star-shaped, meaning that

$$\begin{aligned} {\mathcal {E}} = \{(j,N+1): j=1,\ldots ,N\}, \end{aligned}$$

see Fig. 1 left. In Fig. 1 middle, we have an H-shaped tree with \(N=7\), edge set \({\mathcal {E}} = \{(1,2),(2,3),(2,4),(4,6), (5,6),(6,7)\}\), and we consider problem (18) with \(V = \{1,3,5,7\}\). Finally, Fig. 1 right shows a line-shaped tree with \(N=4\), edge set \({\mathcal {E}} = \{(1,2),(2,3),(3,4)\}\) and \(V = \{1,4\}\). This graph is related to a so-called multiple barycenter problem, and its solution was discussed for the balanced case, e.g., in [9].

In general, it is unclear how to solve problem (18) using Sinkhorn iterations. Therefore, we propose to solve a related multi-marginal problem

$$\begin{aligned} {{\,\mathrm{UMOT}\,}}_\varepsilon ^{\left( c_t, (t_v \varphi _v)_{v \in V}, (\varphi _u)_{u \in U} \!\right) } ((\mu _v)_{v \in V}, \!(\mu _u)_{u \in U}) \end{aligned}$$
(19)

where again \(\varphi _u ~{:}{=}~ \iota _{[0,\infty )}\) for all \(u \in U\) and \(t_v = t_{e}\) if e in \({\mathcal {E}}\) joins v with some other node (indeed well defined for leaves). Then, we can prove in analogy to Lemma 5.1 that the optimal plan \({\hat{\pi }}\) is related to its marginals \({\hat{\pi }}_e\) and \({\hat{\mu }}_u ~{:}{=}~ P_{{\mathbb {X}}_u\#} {\hat{\pi }}\) by

$$\begin{aligned} \sigma _{{\hat{\pi }}} = \frac{\prod _{e \in {{\,\mathrm{{\mathcal {E}}}\,}}} \sigma _{{\hat{\pi }}_e}}{\prod _{u \in U} \sigma _{{\hat{\mu }}_u}^{\deg (u) - 1}}. \end{aligned}$$

Furthermore, we can show similarly as in the proof of Theorem 5.2 the following corollary.

Corollary 5.7

Under the above assumptions, if \({\hat{\pi }}\) is the optimal plan in (19), then the \({\hat{\mu }}_u= P_{{\mathbb {X}}_u\#} {\hat{\pi }}\), \(u\in U\), solve

$$\begin{aligned}&\!\!\!\inf _{\mu _u \in {\mathcal {M}}^+({\mathbb {X}}_u)} \! \sum _{(j,k) \in {{\,\mathrm{{\mathcal {E}}}\,}}} \!\!\!t_{(j,k)}\! {{\,\mathrm{UOT}\,}}_{\varepsilon /t_{(j,k)}}^{(c_{(j,k)}, \varphi _{j},\varphi _{k})}\!(\mu _{j},\mu _{k}) \\&\quad - \varepsilon \sum _{u \in U} (\deg (u) - 1) E(\mu _u). \end{aligned}$$

5.3 Efficient Sinkhorn Iterations for Tree-Structured Costs

Throughout this section, let \({\mathbb {X}}_1,\dotsc ,{\mathbb {X}}_N\) be finite subsets of \({\mathbb {R}}^d\) of size \(M_i ~{:}{=}~ |{\mathbb {X}}_i |> 0\), \(i = 1,\ldots ,N\). Furthermore, let \(\mu _i \in {\mathcal {M}}^+({\mathbb {X}}_i)\) be positive measures that are identified with vectors in \({\mathbb {R}}_+^{M_i}\). Hence, the reference measures \(\gamma _i\) are chosen as the counting measure. Recall that the Sinkhorn mapping for a cost function \(c \in {\mathbb {R}}_{\ge 0}^{M_1 \times \dotsc \times M_N}\) is the concatenation of the \(\text {aprox}\) operator and the multi-marginal \((c,\varepsilon )\)-transform. As the former is applied pointwise, its computational cost is negligible. Hence, it suffices to discuss the efficient implementation of the multi-marginal \((c,\varepsilon )\)-transform. For vectors and matrices, we denote pointwise multiplication by \(\odot \) and pointwise division by \(\oslash \). Set

$$\begin{aligned} K ~{:}{=}~ \exp (-c / \varepsilon ) \in {\mathbb {R}}_+^{M_1 \times \dotsc \times M_N}. \end{aligned}$$

For efficiency reasons, we perform computations in the \(\exp \)-domain, i.e., instead of the Sinkhorn iterates \((f^{(n)})_{n \in {\mathbb {N}}}\) in Algorithm 1 we consider

$$\begin{aligned} u^{(n)}&~{:}{=}~ u_1^{(n)} \otimes \dotsc \otimes u_N^{(n)}\\ u_i^{(n)}&~{:}{=}~ \exp \Bigl (\frac{f^{(n)}_i}{\varepsilon }\Bigr ) , \quad i = 1,\dotsc ,N. \end{aligned}$$

Convergence of Algorithm 1 implies convergence of \(u^{(n)}\) to some \({{\hat{u}}} \in {\mathbb {R}}_+^{M_1 \times \dotsc \times M_N}\). By Proposition 3.4, the optimal plan \(\hat{\pi }\) is given by \(\hat{\pi } = K \odot {{\hat{u}}}\). For \(n \equiv j \mod N\) the Sinkhorn updates in the \(\exp \)-domain can be written as

$$\begin{aligned} u_j^{(n)} = \exp \biggl ( -\frac{1}{\varepsilon } \text {aprox}_{\varphi _i^*}^\varepsilon \biggl (-\varepsilon \ln \biggl ( \frac{u_j^{(n-1)} \odot \mu _j}{P_{{\mathbb {X}}_j\#}(K \odot u^{(n-1)})}\biggr )\biggr ) \biggr ), \end{aligned}$$

where division has to be understood componentwise. Note that in this context \(P_{{\mathbb {X}}_j\#}(\cdot )\) corresponds to summing over all but the jth dimension. Although the involved expression \(\exp ( -\frac{1}{\varepsilon } \text {aprox}_{\varphi _i^*}^\varepsilon (-\varepsilon \ln (\cdot )))\) appears to be complicated, it simplifies for all the entropies from Example 4.5, see also [12].

As recently discussed for the balanced case in [32], multi-marginal Sinkhorn iterations can be computed efficiently if the cost function decouples according to a tree, see also [2] for a wider class of cost functions. In this section, we generalize the approach for tree-structured costs to the unbalanced setting. As in the balanced case, computing the projections \(P_{{\mathbb {X}}_j\#}(K \odot u^{(n)})\), \(j = 1,\dotsc ,N\), is the computational bottleneck of the Sinkhorn algorithm. Fortunately, the Sinkhorn iterations reduce to standard matrix-vector multiplications in our particular setting.

figure f

Consider a tree \({{\,\mathrm{{\mathcal {G}}}\,}}= ({{\,\mathrm{{\mathcal {V}}}\,}},{{\,\mathrm{{\mathcal {E}}}\,}})\) as in Subsect. 5.2 and corresponding cost functions

$$\begin{aligned} c = \bigoplus _{(j,k) \in {{\,\mathrm{{\mathcal {E}}}\,}}} c^{(j,k)}, \quad c^{(j,k)} \in {\mathbb {R}}_{\ge 0}^{M_j \times M_k}. \end{aligned}$$

Then, it holds \(K_{i_1, \dots , i_N} = \prod _{(j, k)\in {{\,\mathrm{{\mathcal {E}}}\,}}}K^{(j, k)}_{i_j, i_k}\), where

$$\begin{aligned} K^{(j,k)} ~{:}{=}~ \exp (-c^{(j,k)}/\varepsilon ) \in {\mathbb {R}}_+^{M_j \times M_k}. \end{aligned}$$

The next result, c.f. [32, Thm. 3.2], is the main ingredient for an efficient implementation of Algorithm 1 for solving \(\mathrm {UMOT}_{\varepsilon }\) with tree-structured cost functions.

Theorem 5.8

The projection onto the jth marginal of \(K \odot u\) is given by

$$\begin{aligned} {P_{{\mathbb {X}}_j\#}}(K \odot u) = u_j \odot \bigodot _{l \in {\mathcal {N}}_j} \alpha _{(j,l)}. \end{aligned}$$

Here, the \(\alpha _{(j,k)}\) are computed recursively for \((j,k) \in {\bar{{{\,\mathrm{{\mathcal {E}}}\,}}}} ~{:}{=}~ \{(v,w)| (v,w) \in {\mathcal {E}} \text { or } (w,v) \in {\mathcal {E}} \}\) starting from the leaves by

$$\begin{aligned} \alpha _{(j,k)} = K^{(j,k)} \biggl (u_k \odot \bigodot _{l \in {\mathcal {N}}_k \setminus \{j\}} \alpha _{(k,l)}\biggr ), \end{aligned}$$
(20)

with the usual convention that the empty product is 1.

First, we traverse the tree \({{\,\mathrm{{\mathcal {G}}}\,}}\) by a pre-order depth-first search with respect to a root \(v_0\). This results in a strict ordering of the nodes, which is encoded in the list \({{\,\mathrm{{\mathcal {V}}}\,}}_{\rightarrow }\). Every node \(k \in V\) except the root has a unique parent, denoted by p(k). We denote by \({{\,\mathrm{{\mathcal {V}}}\,}}_{\leftarrow }\) the reversed list \({{\,\mathrm{{\mathcal {V}}}\,}}_{\rightarrow }\). Note that the order in which we update the vectors \((\alpha _{(j,k)})_{(j,k) \in \bar{{{\,\mathrm{{\mathcal {E}}}\,}}}}\) and potentials \((u_j)_{j \in {\mathcal {V}}}\) in Algorithm 2 fits to the underlying recursion in (20). Furthermore, the computational complexity of Algorithm 2 is linear in N. More precisely, \(2(N-1)\) matrix-vector multiplications are performed to update every \(u_j\) once, which is in alignment with the two-marginal case. In particular, solving N decoupled problems has the same complexity per iteration with the disadvantage that the marginals of the obtained transport plans do not necessarily fit to each other. Although we discussed Algorithm 2 mainly for computing barycenters, it can also be applied without free marginals, see Sect. 6.3 for a numerical example.

6 Numerical Examples

In this section, we present three numerical examples, where the first two confirm our theoretical findings from Section 5. Part of our Python implementation is built upon the POT toolbox [27].

From now on, all measures are of the form \(\sum _{k=1}^m \mu ^k \delta _{x_k}\) with support points \(x_k\in {\mathbb {R}}^d\), \(k=1,\ldots ,m\). We always use the cost functions \(c(x, y) = \left\| x-y\right\| ^2\) with corresponding cost matrices \(c = c(x_j,x_k)_{j,k=1}^m\). All reference measures are the counting measure, see Remark 3.2iii), i.e., we exclusively deal with entropy regularization.

6.1 Barycenters of 1D Gaussians

We start with computing the barycenter for two simple measures \(\mu _1\), \(\mu _2\), which are produced by sampling truncated normal distributions \({\mathcal {N}}(0.2,0.05)\) and \(2{\mathcal {N}}(0.8,0.08)\) on [0, 1] on a uniform grid. Clearly, this choice makes an unbalanced approach necessary. As before, we denote the discrete spaces and those of the barycenter by \({\mathbb {X}}_1 = {\mathbb {X}}_2 = {\mathbb {Y}}\). To approximate the marginals, we use the Shannon entropy functions \(\varphi = \varphi _1 = \varphi _2\) so that \(D_\varphi = {{\,\mathrm{KL}\,}}\). First, we solve the barycenter problem (13), which reads for \(t_1 = 1 - t\) and \(t_2 = t \in (0,1)\) as

$$\begin{aligned} {\hat{\xi }} =&\mathop {\mathrm {argmin}}\limits _{\xi } \sum _{i=1}^2 t_i \min _{ \pi ^{(i)} } \langle c, \pi ^{(i)} \rangle + \varepsilon E(\pi ^{(i)} )\\&+ t_i{{\,\mathrm{KL}\,}}\bigl ( P_{{\mathbb {X}}_{i\#}} \pi ^{(i)} , \mu _i \bigr ) \\&\mathrm {subject \; to} \quad {P_{{\mathbb {Y}}\#}} \pi ^{(1)} = {P_{{\mathbb {Y}}\#}} \pi ^{(2)} = \xi . \end{aligned}$$

The resulting barycenter for \(\varepsilon =0.005\) is computed using the Sinkhorn iterations described in [12, Sec. 5.2] and is shown in Fig. 2 for different \(t \in (0,1)\) together with the marginals \({\tilde{\mu }}_i ~{:}{=}~ P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}^{(i)}\), \(i=1,2\).

We compare these barycenters with the marginal \(P_{{\mathbb {Y}}\#} {\hat{\pi }}\) of the corresponding optimal plan \({\hat{\pi }}\) for the multi-marginal problem

$$\begin{aligned}&\mathop {\mathrm {argmin}}\limits _{\pi \in {\mathcal {M}}^+({\mathbb {X}}\times {\mathbb {Y}})} \langle c_{t}, \pi \rangle + \varepsilon E(\pi )\\&\quad + \sum _{i=1}^2 t_i{{\,\mathrm{KL}\,}}\bigl ( P_{{\mathbb {X}}_{i\#}} \pi , \mu _i\bigr ) \end{aligned}$$

computed by Algorithm 2. The resulting marginals are provided in Fig. 2. As explained in Remark 5.4, the barycenters \({\hat{\xi }}\) appear smoothed compared to \(P_{{\mathbb {Y}}\#} {\hat{\pi }}\). As already mentioned, we do not have relations with shortest paths due to the missing metric structure.

Fig. 2
figure 2

Comparison between the unbalanced barycenter \({\hat{\xi }}\) of two Gaussians (left) and the marginal \(P_{{\mathbb {Y}}\#} {\hat{\pi }}\) of the optimal transport plan for the corresponding multi-marginal problems (right) with different \(t \in (0,1)\). The input measures \(\mu _i\), \(i=1,2\) and the corresponding marginals \({\tilde{\mu }}_i\), \(i=1,2\), of the respective transport plans are shown as well

6.2 H-tree Shaped Cost Functions

Next, we turn our attention to the “interpolation” of four gray value images of size \(100 \times 100\) considered as probability measures \(\mu _v\), \(v = 1,3,5,7\), along a tree that is H-shaped, see Fig. 1. The images are depicted in the four corners of Fig. 3a. In this example, the measures corresponding to the inner nodes with \(u = 2,4,6\) have to be computed. For this purpose, we choose \(\varepsilon =4\cdot 10^{-4}\) and \(D_{\varphi _v}(\cdot , \mu _v)=0.05{{\,\mathrm{KL}\,}}(\cdot , \mu _v)\).

Comparison with \({{\,\mathrm{MOT}\,}}_\varepsilon \) As the measures have the same mass, we can compare our proposed \({{\,\mathrm{UMOT}\,}}_\varepsilon \) approach with the balanced \({{\,\mathrm{MOT}\,}}_\varepsilon \) method. Equal costs c as well as equal weights are assigned to the edges. Note that the total cost decouples according to the H-shaped tree. The obtained results for \({{\,\mathrm{MOT}\,}}_\varepsilon \) and \({{\,\mathrm{UMOT}\,}}_\varepsilon \) are depicted in Figs. 3a and b, respectively. For \({{\,\mathrm{UMOT}\,}}_\varepsilon \), the corners contain the marginals \(P_v {\hat{\pi }} = P_{{\mathbb {X}}_v\#} {\hat{\pi }}\) instead of the given measures. As the mass in the different image regions is different, mass is transported between them in the \({{\,\mathrm{MOT}\,}}_\varepsilon \) interpolation. In contrast, only a minimal amount of mass is transported between the images regions for \({{\,\mathrm{UMOT}\,}}_\varepsilon \), where the mass difference is compensated by only approximately matching the prescribed marginals. This behavior can be controlled by adjusting the weights in the \(\varphi \)-divergences.

While \({{\,\mathrm{MOT}\,}}\) becomes numerically unstable for smaller \(\varepsilon \) than \(4\cdot 10^{-4}\), the \({{\,\mathrm{UMOT}\,}}\) problem remains solvable for smaller values of \(\varepsilon \). In our numerical experiments, which are not reported here, this led to less blurred images.

Fig. 3
figure 3

Different “interpolations” between the four given images at the corners of Fig. 3a. Top: H-tree structured cost function. Bottom: Star-shaped decomposition approach. A common color coding is used for all images

Comparison with Multiple Star Graph Barycenters Next, we provide a heuristic comparison with an alternative interpolation approach. Instead of computing three interpolating measures simultaneously, we solve three individual barycenter problems based on star-shaped graphs with leaves corresponding to \(\mu _v\), \(v = 1,3,5,7\). This is done both with the \({{\,\mathrm{UOT}\,}}_\varepsilon \) and corresponding \({{\,\mathrm{UMOT}\,}}_\varepsilon \) approach. More precisely, we solve (13) and (16) three times with weights \(1/12\cdot (5, 5, 1, 1)\), \(1/4\cdot (1, 1, 1, 1)\) and \(1/12\cdot (1, 1, 5, 5)\), respectively. These weights have been derived from the solution of the balanced H-graph problem with Dirac measures at the leaves, which is easy to obtain from a linear system corresponding to the first-order optimality conditions. The results are provided in Figs. 3c and d. Noteworthy, both interpolations have an even less pronounced mass transfer between the different image structures. However, the computed images look considerably smoother than their counterparts in Fig. 3b. Again, as expected, we observe that the \({{\,\mathrm{UOT}\,}}_\varepsilon \) results in Fig. 3d are more blurred than the corresponding \({{\,\mathrm{UMOT}\,}}_\varepsilon \) interpolations in Fig. 3c. As they are all marginals of multi-marginal transport plans, the images in Figs. 3a and b have the same mass. In contrast, the images in Figs. 3c and d do not necessarily have the same mass as they are marginals of different transport plans. Hence, depending on the application, one or the other approach might be preferable.

Note that in order to compute the multiplications with the matrices K in Algorithm 2, we exploit the fact that the Gaussian kernel convolution K is separable in the two spatial dimensions and can be performed over the rows and columns of the images one after another, such that we never actually store the matrix \(K\in {\mathbb {R}}^{10000\times 10000}\). Consequently, we cannot use stabilizing absorption iterations as proposed in [12].

6.3 Particle Tracking and Transfer Operators

Fig. 4
figure 4

Top: Clean dataset. Middle upper: Added noise to get the measures \(\mu _i\), \(i=1,\ldots ,5\). Middle lower: Marginals \(P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}\), \(i=1,\ldots ,5\), of optimal transport plan \({\hat{\pi }}\) for \({{\,\mathrm{UMOT}\,}}_\varepsilon (\mu )\). Bottom: Remaining noise and artifacts (marginals – clean)

Fig. 5
figure 5

Zoom into marked patch in Fig. 4. Note that the brightness scale differs from Fig. 4 for better contrast. Left: Ground truth \(\mu _{ \mathrm {clean},5 }\). Middle left: Noisy \(\mu _{ 5 }\). Middle right: Transferred image \(K_{{{\,\mathrm{UMOT}\,}}}^\mathrm {T}\, \mu _{\mathrm {clean},1}\). Right: Transferred image \(K_{{{\,\mathrm{UOT}\,}}}^\mathrm {T}\, \mu _{\mathrm {clean},1}\)

Finally, we investigate whether computing a single joint \({{\,\mathrm{UMOT}\,}}_\varepsilon \) solution can be beneficial compared to computing several \({{\,\mathrm{UOT}\,}}_\varepsilon \) plans sequentially, e.g., for particle tracking. To this end, we create a time series of five images. Each image has size \(100\times 100\) pixels and consists of “dots” by sampling uniform noise, setting all values above a small threshold to zero, and applying a Gaussian filter with small variance. One time step corresponds to shifting the image by two pixels downward filling with the small constant background value from the top, which results in images \(\mu _{\mathrm {clean},i}\), \(i=1, \cdots ,5\). We modify this time series of five images by adding dots randomly for every time step in a similar manner. Consequently, the data consist of a drift component and some random dots popping up and disappearing again. The resulting data \(\mu _i\), \(i=1, \cdots ,5\), are shown in Fig. 4. As we want to extract only the drift component, we apply \({{\,\mathrm{UMOT}\,}}_\varepsilon (\mu )\) for a line-tree-structured cost function with the same costs \(c_i\) along the path, regularization parameter \(\varepsilon = 10^{-4}\), and \(D_{\varphi _i}(\cdot ,\mu _i)=7\cdot 10^{-4} {{\,\mathrm{TV}\,}}(\cdot , \mu _i)\), \(i=1,\cdots ,5\). We expect that the hard thresholding of the corresponding \(\text {aprox}\)-operators for \({{\,\mathrm{TV}\,}}(\cdot , \mu _i)\), \(i=1,\cdots ,5\), is particularly well suited for removing the noise in our example, see also [8, 26]. The resulting marginals of the optimal transport plan \(P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}\) are shown in Fig. 4. Indeed, the method manages to remove most of the noise dots.

Transfer Operators For our next comparison, we use OT-related transfer operators, which have recently been discussed in [36]. We abuse notation in this section by sometimes identifying empirical measures with their vectors or matrices of weights for simplicity. In a nutshell, assuming a discrete, two-marginal setting of probability measures with optimal transport matrix \({\hat{\pi }}\) and left marginal vector \(\mu _\ell = (\mu _\ell ^k)_{k=1}^m\), we can define a row-stochastic transition matrix K by setting

$$\begin{aligned} K ~{:}{=}~ \mathop {\mathrm {diag}}\limits \left( \mu _\ell ^{-1}\right) {\hat{\pi }}. \end{aligned}$$

This concept allows us to propagate other measures than \(\mu _\ell \) forward in time by multiplication with \(K^\mathrm {T}\). Note that there is a continuous analog in terms of Frobenius–Perron operators, Markov kernels, and the disintegration theorem; see [6, 7, 30, 35, 38] for details.

Now, we compute the marginal \({\hat{\pi }}_{1,5} ~{:}{=}~ P_{{\mathbb {X}}_1 \times {\mathbb {X}}_5\#} {\hat{\pi }}\) of the optimal \({{\,\mathrm{UMOT}\,}}_\varepsilon (\mu )\) transport plan \({\hat{\pi }}\). Using the marginal \({\tilde{\mu }}_1 ~{:}{=}~ P_{{\mathbb {X}}_{1\#}} {\hat{\pi }}_{1,5} = P_{{\mathbb {X}}_{1\#}} {\hat{\pi }}\), we get the transfer operator

$$\begin{aligned} K_{{{\,\mathrm{UMOT}\,}}} = \mathop {\mathrm {diag}}\limits \left( {\tilde{\mu }}_1^{-1}\right) {\hat{\pi }}_{1,5}. \end{aligned}$$

Then, we propagate the first clean image \(\mu _{\mathrm {clean},1}\) by this transfer operator, i.e., we compute \(K_{{{\,\mathrm{UMOT}\,}}}^\mathrm {T}\, \mu _{\mathrm {clean},1}\). The result is shown in Fig. 5.

For comparison, we also compute \(N-1\) successive \({{\,\mathrm{UOT}\,}}_\varepsilon (\mu _i,\mu _{i+1})\) plans \({\hat{\pi }}^{(i)}\), \(i=1, \cdots , 4\). Denoting the marginals by \({\tilde{\mu }}_i ~{:}{=}~ P_{{\mathbb {X}}_{i\#}} {\hat{\pi }}^{(i)}\), \(i=1, \cdots , 4\), we consider the transfer kernel

$$\begin{aligned} K_{{{\,\mathrm{UOT}\,}}} = \prod _{i=1}^{4} \mathop {\mathrm {diag}}\limits \left( {\tilde{\mu }}_i^{-1}\right) {\hat{\pi }}^{(i)}. \end{aligned}$$

Then, we transfer the clean first image by this operator, i.e., we compute \(K_{{{\,\mathrm{UOT}\,}}}^\mathrm {T}\, \mu _{\mathrm {clean},1}\). The obtained results are shown in Fig. 5. Note that similarly as described in the previous subsection, the computations can be carried out using separable convolutions without ever storing the large matrix K.

As we wanted to extract the drift behavior using only the noisy images, the propagated images should be compared to \(\mu _{\mathrm {clean},5}\), i.e., the last image of the first row in Fig. 4, which is just a shifted version of the first image. In some sense, we can interpret this image as the propagation using the “true” transfer operator. There are considerably less artifacts visible in the \({{\,\mathrm{UMOT}\,}}_\varepsilon \) propagated image compared to the \({{\,\mathrm{UOT}\,}}_\varepsilon \) version. This is particularly pronounced in the middle left part of the images. As an error measure, we also computed the squared Euclidean distances between the propagated images and the ground truth, which are 2.98 and 6.48 for the \({{\,\mathrm{UMOT}\,}}_\varepsilon \) and \({{\,\mathrm{UOT}\,}}_\varepsilon \) version, respectively.

From an intuitive point of view, the results are not surprising. If we are only provided with a single pair of images from the sequence, it is much harder to judge which points correspond to noise than for a whole sequence of images. Note that a single Sinkhorn iteration for the coupled \({{\,\mathrm{UMOT}\,}}_\varepsilon \) problem has the same computational complexity as for all of the decoupled \({{\,\mathrm{UOT}\,}}_\varepsilon \) problems together. Hence, the \({{\,\mathrm{UMOT}\,}}_\varepsilon \) approach appears to be better suited for this application, as it incorporates more information about the problem without significant additional computational cost.

7 Conclusions

In this paper, we introduced a regularized unbalanced multi-marginal optimal transport framework, abbreviated \({{\,\mathrm{UMOT}\,}}_\varepsilon \), bridging the gap between regularized unbalanced optimal transport and regularized multi-marginal optimal transport. We outlined how the Sinkhorn algorithm can be adapted to solve \({{\,\mathrm{UMOT}\,}}_\varepsilon \) efficiently for tree-structured costs. For this case, we have also shown how \({{\,\mathrm{UMOT}\,}}_\varepsilon \) provides alternative solutions of barycenter-like problems with desirable properties, such as improved sharpness. In the future, we plan to examine \({{\,\mathrm{UMOT}\,}}_\varepsilon \) in connection with particle cluster tracking methods, e.g., following the ideas in [36]. Furthermore, it would be interesting to examine a reformulation of the regularized unbalanced barycenter problem as UMOT problem using a cost function similar to that in Remark 5.5. Additionally, we want to investigate the theoretical relation between the two interpolation approaches from Sect. 6.2. Finally, we are interested in \({{\,\mathrm{UMOT}\,}}_\varepsilon \) for measures having the same moments up to a certain order and for measures living on special manifolds such as, e.g., torus or spheres, see also [22].