1 Introduction

1.1 Motivation

(Computational) optimal transport Optimal transport is a fundamental optimization problem with applications in various branches of mathematics. Let \(\mu \) and \(\nu \) be probability measures over spaces X and Y and let \(\Pi (\mu ,\nu )\) be the set of transport plans, i.e. probability measures on \(X \times Y\) with \(\mu \) and \(\nu \) as first and second marginal. Further, let \(c : X \times Y \rightarrow \mathbb {R}\) be a cost function. The Kantorovich formulation of optimal transport is then given by

$$\begin{aligned} {\inf \left\{ \int _{X \times Y} c(x,y)\,\text {d}\pi (x,y) \big \vert \pi \in \Pi (\mu ,\nu ) \right\} .} \end{aligned}$$
(1.1)

We refer to the monographs [21, 26] for a thorough introduction and historical context. Due to its geometric intuition and robustness it is becoming particularly popular in image and data analysis. Therefore, one main challenge is the development of efficient numerical methods, which has seen immense progress in recent years, such as solvers for the Monge–Ampère equation [3], semi-discrete methods [13, 16], entropic regularization [9], and multi-scale methods [18, 24]. An introduction to computational optimal transport, an overview on available efficient algorithms, and applications can be found in [19].

Domain decomposition Benamou introduced a domain decomposition algorithm for Wasserstein-2 optimal transport on \(\mathbb {R}^d\) [2], based on Brenier’s polar factorization [4]. In the language of (1.1) the algorithm works as follows: Let X and Y be subsets of \(\mathbb {R}^d\) and \(c(x,y)=\Vert x-y\Vert ^2\) the squared Euclidean distance. Let now \((X_1,X_2,X_3)\) be a partition of X and write \(X_{\{1,2\}} :=X_1 \cup X_2\), \(X_{\{2,3\}} :=X_2 \cup X_3\) and let \(\pi ^{(0)} \in \Pi (\mu ,\nu )\) be some initial feasible transport plan. The next iterate \(\pi ^{(1)}\) is then obtained by optimizing \(\pi ^{(0)}\) on the set \(X_{\{1,2\}} \times Y\), while keeping it fixed on \(X_3 \times Y\). Next, \(\pi ^{(2)}\) is obtained by optimizing \(\pi ^{(1)}\) on \(X_{\{2,3\}} \times Y\), while keeping it fixed on \(X_1 \times Y\). These two steps are then repeated. This is illustrated (for entropic optimal transport) in Fig. 1.

In [2] it is shown that this algorithm converges to the globally optimal solution when \(\mu \) is Lebesgue absolutely continuous and \((X_1,X_2,X_3)\) satisfy a ‘convex overlap’ condition, which roughly states that ‘if a function is convex on \(X_{\{1,2\}}\) and \(X_{\{2,3\}}\), then it must be convex on X’. Further, it is shown that this method can easily be extended to more than three cells, thus leading to a parallelizable algorithm. Unfortunately, it is also demonstrated that the discretized method does not converge to a global minimizer (for \(d>1\)). When the discretization is refined, it is shown that the optimal solution is recovered in the limit. But this requires an increasing number of discretization points in the three sets \((X_1,X_2,X_3)\) and thus numerically obtaining a good approximation of the globally optimal solution remains challenging with this method.

1.2 Outline and contribution

Entropic domain decomposition The key observation in this article is that for entropic optimal transport the domain decomposition algorithm converges to the unique globally optimal solution for arbitrary (measurable) bounded costs c, as soon as \(\mu (X_2)>0\). In particular, this covers the case where X and Y are discrete and finite. Therefore, entropic smoothing is not only a useful tool in solving the optimal transport sub-problems arising in domain decomposition, it also facilitates convergence of the whole algorithm. While the convergence rate can be very slow in suitable worst-case examples, we show empirically that it is fast on problems with sufficient geometric structure, thus leading to a practical and efficient parallel numerical method for large-scale problems.

Preliminaries and definition of the algorithm We start by establishing the basic mathematical setting and notation in Section 2.1. Some basic facts about entropic optimal transport are recalled in Section 2.2. In Sect. 3 the entropic domain decomposition algorithm is defined and a simple convergence proof is sketched.

Fig. 1
figure 1

Iterates of the domain decomposition algorithm for entropic optimal transport in one dimension (yellow represents high, blue low mass density). \(\mu =\nu \) is the uniform probability measure on \(X=Y=[0,1]\), equipped with cost function \(c(x,y)=\Vert x-y\Vert ^2\). The optimal coupling is a blurred ‘diagonal’ measure (approximately equal to \(\pi ^{(5)}\)). We initialize with an ‘anti-diagonal’ feasible \(\pi ^{(0)}\), i.e. the order must be approximately reversed. X is divided into three cells (dashed lines, first panel) and \(\pi ^{(\ell )}\) is alternatingly optimized over \(X_{\{1,2\}}\) and \(X_{\{2,3\}}\). Iterates \(\pi ^{(1)}\) to \(\pi ^{(5)}\) were obtained by optimizing the previous iterate over the red rectangle (color figure online)

Theoretical and empirical worst-case convergence rate Section 4 is dedicated to proving that the iterates converge to the unique optimal solution linearly with respect to the Kullback–Leibler divergence. We first give a proof for the three-cell example discussed above (Sect. 4.1) and then generalize it to more general decompositions (Sect. 4.2). Our results apply to arbitrary (measurable) bounded cost functions c, marginal measures \(\mu \) and \(\nu \), and we make minimal assumptions on the decomposition structure (it must be ‘connected’ in a suitable sense, Definition 7, in particular convergence only requires \(\mu (X_2)>0\) in the three-cell example). As the entropic regularization parameter tends to zero, our rate bound tends to one exponentially. This is reminiscent of the convergence rate for the standard Sinkhorn algorithm established in [10] with respect to Hilbert’s projective metric. While we do not expect that our bounds on the convergence rate are tight, we demonstrate in Sect. 4.3 with some numerical (near) worst-case examples that they capture the qualitative behaviour of the algorithm on difficult problems.

Relation to parallel sorting The slow convergence rates from Sect. 4 do not suggest that the algorithm is efficient. But these slow rates rely on maliciously designed counter-examples. In practice usually much more geometric structure is available, similar to the setting originally considered by Benamou [2]. While a detailed analysis of the convergence rate in these cases is beyond the scope of this article, to provide some intuition, we discuss in Sect. 5 the relation of the domain decomposition algorithm for optimal transport with the odd-even transposition parallel sorting algorithm [14, Exercise 37] in one dimension. This algorithm is known to converge in O(N) iterations where N is the number of cells that the domain is partitioned into.

Practical implementation and geometric large-scale examples In Sect. 6 we provide a practical version of the domain decomposition algorithm, leading to an efficient numerical method for large scale problems with geometric structure. We discuss how the memory footprint can be reduced, how one can handle approximate solutions to the sub-problems as obtained by the Sinkhorn algorithm, and how to combine the algorithm with the \(\varepsilon \)-scaling heuristic and a multi-scale scheme, see [23].

Numerical experiments and comparison with single Sinkhorn algorithm The efficiency of the implementation is then demonstrated numerically by computing the Wasserstein-2 distance between images in Sect. 7. We report accurate approximate primal and dual solutions with low entropic regularization artifacts after a logarithmic number (w.r.t. the marginal size) of domain decomposition iterations. On a standard desktop computer with 6 cores a high-quality approximate optimal transport between two mega-pixel images is computed in approximately 4 minutes.

A comparison to a single Sinkhorn algorithm, when applied to the full problem, is given in Sect. 7.4. We find that the domain decomposition approach has several key advantages. First, it allows more extensive parallelization. The standard Sinkhorn algorithm is already parallel in the sense that each half-iteration consists essentially of a matrix-vector product that can be parallelized. But the results of this operation must be communicated to all workers after each half-iteration, thus only allowing local parallelization such as a single GPU. In the domain decomposition variant, small sub-problems are solved independently and their results must only be communicated upon completion, thus allowing parallelization over multiple machines.

Second, domain decomposition allows to reduce the memory footprint. The naive Sinkhorn algorithm requires storage of the full kernel matrix, the size of which grows quadratically with the marginal size. This can be avoided by using efficient heuristics, such as Gaussian convolutions or pre-factored heat kernels [25] but these methods only work in particular settings and with sufficiently high regularization. Alternatively, with adaptive kernel truncation [23] the memory demand is approximately linear in the marginal size. But to retain numerical stability, the truncation parameter may not be chosen too aggressively, thus still making memory demand a practical constraint. In the domain decomposition variant, only the kernel matrices for the sub-problems that are currently being re-solved are needed; for all other sub-problems only information about their marginals is kept. In addition, via parallelization this may also be distributed over several computers.

In our numerical experiments we find that in comparable runtime (even without parallelization) the domain decomposition algorithm obtains more accurate primal and dual iterates by requiring only approximately 11% of the memory of the single Sinkhorn solver. Dealing only with small sub-problems at a time also makes it easier to handle numerical delicacies associated with small entropic regularization parameters. The domain decomposition method therefore more reliably solves large problems.

2 Background

2.1 Setting and notation

  • X and Y are compact metric spaces. We assume compactness to avoid overly technical arguments while covering the numerically relevant setting.

  • For a compact metric space Z the set of finite signed Radon measures over Z is denoted by \(\mathcal {M}(Z)\). The subsets of non-negative and probability measures are denoted by \(\mathcal {M}_+(Z)\) and \(\mathcal {M}_1(Z)\). The Radon norm of \(\mu \in \mathcal {M}(Z)\) is denoted by \(\Vert \mu \Vert _{\mathcal {M}(Z)}\) and one has \(\Vert \mu \Vert _{\mathcal {M}(Z)}=\mu (Z)\) for \(\mu \in \mathcal {M}_+(Z)\). We often simply write \(\Vert \mu \Vert \) for the norm when the meaning is clear from context.

  • For \(\mu \in \mathcal {M}_+(Z)\) we denote by \(L^1(Z,\mu )\), \(L^\infty (Z,\mu )\) the usual function spaces and add a subscript \(+\) to denote the subsets of \(\mu \)-a.e. non-negative functions. The corresponding norms are denoted by \(\Vert \cdot \Vert _{L^1(Z,\mu )}\) and \(\Vert \cdot \Vert _{L^\infty (Z,\mu )}\) but we often merely write \(\Vert \cdot \Vert _1\) and \(\Vert \cdot \Vert _\infty \) (or even \(\Vert \cdot \Vert \) for the latter) when space and measure are clear from context.

  • For \(\mu \in \mathcal {M}_+(Z)\) and a measurable function \(u : Z \rightarrow \mathbb {R}\) we denote by \(\left\langle \mu , u \right\rangle \) the integration of u with respect to \(\mu \), and define the normalized integral

    $$\begin{aligned} \fint _Z u \,\text {d}\mu :=\frac{1}{\mu (Z)} \int _Z u\,\text {d}\mu = \frac{\left\langle \mu , u \right\rangle }{\mu (Z)}. \end{aligned}$$
    (2.1)
  • For \(\mu \in \mathcal {M}_+(Z)\) and measurable \(S \subset Z\) the restriction of \(\mu \) to S is denoted by \(\mu {{\llcorner }}S\). Similarly, for a measurable function \(u : Z \rightarrow \mathbb {R}\), set \(u {{\llcorner }}S(x)=u(x)\) if \(x \in S\), and 0 otherwise.

  • The maps \(\text {P}_X : \mathcal {M}_+(X \times Y) \rightarrow \mathcal {M}_+(X)\) and \(\text {P}_Y : \mathcal {M}_+(X \times Y) \rightarrow \mathcal {M}_+(Y)\) denote the projections of measures on \(X \times Y\) to their marginals, i.e.

    $$\begin{aligned} (\text {P}_X \pi )(S_X) :=\pi (S_X \times Y) \qquad \text {and} \qquad (\text {P}_Y \pi )(S_Y) :=\pi (X \times S_Y) \end{aligned}$$

    for \(\pi \in \mathcal {M}_+(X \times Y)\), \(S_X \subset X\), \(S_Y \subset Y\) measurable.

  • For (measurable) functions \(a : X \rightarrow \mathbb {R}\cup \{\infty \}\), \(b : Y \rightarrow \mathbb {R}\cup \{\infty \}\), the functions \(a \oplus b, a \otimes b : X \times Y \rightarrow \mathbb {R}\cup \{\infty \}\) are given by

    $$\begin{aligned} (a \oplus b)(x,y)&= a(x) + b(y),&(a \otimes b)(x,y)&= a(x) \cdot b(y). \end{aligned}$$

    For two measures \(\mu \in \mathcal {M}_+(X)\), \(\nu \in \mathcal {M}_+(Y)\) their product measure on \(X \times Y\) is denoted by \(\mu \otimes \nu \).

2.2 Entropic optimal transport

We recall in this Section the entropic regularization approach to optimal transport and collect the main existence and characterization results which will be needed in this article.

Definition 1

(Kullback–Leibler divergence) Let Z be a compact metric space. For \(\mu \in \mathcal {M}(Z)\), \(\nu \in \mathcal {M}_+(Z)\) the Kullback–Leibler divergence (or relative entropy) of \(\mu \) w.r.t. \(\nu \) is given by

$$\begin{aligned}&{{\,\mathrm{KL}\,}}(\mu |\nu ) :={\left\{ \begin{array}{ll} \int _Z \varphi \left( \tfrac{\text {d}\mu }{\text {d}\nu }\right) \,\text {d}\nu &{} \text {if } \mu \ll \nu ,\,\mu \ge 0, \\ + \infty &{} \text {else,} \end{array}\right. }\\&\quad \text {with} \quad \varphi (s) :={\left\{ \begin{array}{ll} s\,\log (s)-s\,+1 &{} \,\text {if } s>0, \\ 1 &{} \text {if } s=0, \\ + \infty &{} \text {else.} \end{array}\right. } \end{aligned}$$

The Fenchel–Legendre conjugate of \(\varphi \) is given by \(\varphi ^*(z) = \exp (z)-1\).

Definition 2

(Entropy regularized optimal transport) Let \(\mu \in \mathcal {M}_1(X)\), \(\nu \in \mathcal {M}_1(Y)\) and \({\hat{\mu }} \in \mathcal {M}_+(X)\) and \({\hat{\nu }} \in \mathcal {M}_+(Y)\) such that

$$\begin{aligned} \Vert {\hat{\mu }}\Vert =\Vert {\hat{\nu }}\Vert&>0,&{\hat{\mu }}&\ll \mu ,&{\hat{\nu }}&\ll \nu ,&\tfrac{\text {d}{\hat{\mu }}}{\text {d}\mu }&\in L^\infty (X,\mu ),&\tfrac{\text {d}{\hat{\nu }}}{\text {d}\nu }&\in L^\infty (Y,\nu ). \end{aligned}$$

Similar to above, denote by

$$\begin{aligned} \Pi ({\hat{\mu }},{\hat{\nu }})&:=\left\{ \pi \in \mathcal {M}_+(X \times Y) \,\big \vert \, \text {P}_X \pi ={\hat{\mu }},\text {P}_Y \pi ={\hat{\nu }}\right\} \end{aligned}$$

the set of transport plans between \({\hat{\mu }}\) and \({\hat{\nu }}\). The condition \(\Vert {\hat{\mu }}\Vert =\Vert {\hat{\nu }}\Vert \) ensures that the set is non-empty. For a lower-semicontinuous cost function \(c \in L^\infty _+(X \times Y, \mu \otimes \nu )\) the Kantorovich optimal transport problem between \({\hat{\mu }}\) and \({\hat{\nu }}\) is then given by

$$\begin{aligned} \inf \left\{ \int _{X \times Y} c(x,y)\,\text {d}\pi (x,y) \big \vert \pi \in \Pi ({\hat{\mu }},{\hat{\nu }}) \right\} . \end{aligned}$$
(2.2)

Existence of minimizers in this setting is provided, for instance, by [26, Theorem 4.1]. For a regularization parameter \(\varepsilon >0\) define

$$\begin{aligned} k:=\exp (-c/\varepsilon ) \in L^\infty _+(X \times Y,\mu \otimes \nu ), \qquad \text {and} \qquad K:=k\cdot (\mu \otimes \nu ). \end{aligned}$$
(2.3)

In this article we will frequently exploit that k is bounded by \(0 < \exp (-\Vert c\Vert _{\infty })/\varepsilon ) \le k \le 1\). Then the entropic optimal transport problem, regularized with respect to \(\mu \otimes \nu \) is given by

$$\begin{aligned} {\inf \left\{ \varepsilon {{\,\mathrm{KL}\,}}(\pi |K) \,\big \vert \, \pi \in \Pi ({\hat{\mu }},{\hat{\nu }})\right\} .} \end{aligned}$$
(2.4)

Problem (2.4) can be solved (approximately) with the Sinkhorn algorithm (Remark 1). In addition, in our article, entropic smoothing will ensure convergence of the domain decomposition scheme. We refer to [19, Chapter 4] for an overview on entropic optimal transport and its numerical implications. The (\(\Gamma \)-)convergence of (2.4) to (2.2) as \(\varepsilon \rightarrow 0\) has been shown in [6, 7, 15] in various settings and with different strategies.

We now collect some results about entropic optimal transport required in this article. A proof is given in the “Appendix”.

Proposition 1

(Optimal entropic transport couplings)

  1. (i)

    (2.4) has a unique minimizer \(\pi ^*\in \Pi ({\hat{\mu }},{\hat{\nu }})\).

  2. (ii)

    There exist measurable \(u^*: X \rightarrow \mathbb {R}_+\), \(v^*: Y \rightarrow \mathbb {R}_+\) such that \(\pi ^*= u^*\otimes v^*\cdot K\). \(u^*\) and \(v^*\) are unique \(\mu \)-a.e. and \(\nu \)-a.e. up to a positive re-scaling \((u^*,v^*) \rightarrow (\lambda \cdot u^*, \lambda ^{-1} \cdot v^*)\) for \(\lambda >0\).

  3. (iii)

    The couple (\(u^*\), \(v^*\)) satisfies the integral equations

    $$\begin{aligned} u^*(x) \int _Y k(x,y')v^*(y')\,\text {d}\nu (y') = \tfrac{\text {d}{\hat{\mu }}}{\text {d}\mu }(x) \quad \text {and} \quad v^*(y) \int _X k(x',y)u^*(x')\,\text {d}\mu (x') = \tfrac{\text {d}{\hat{\nu }}}{\text {d}\nu }(y)\nonumber \\ \end{aligned}$$
    (2.5)

    for \(\mu \)-a.e. \(x \in X\) and \(\nu \)-a.e. \(y \in Y\).

  4. (iv)

    \(u^*\in L^\infty _+(X,\mu )\), \(v^*\in L^\infty _+(Y,\nu )\), \(\Vert u^*\Vert _{L^1(X,\mu )}>0\), \(\Vert v^*\Vert _{L^1(Y,\nu )}>0\),

    $$\begin{aligned} \tfrac{1}{\Vert v^*\Vert _1} \tfrac{\text {d}{\hat{\mu }}}{\text {d}\mu }(x)&\le u^*(x) \le \tfrac{\exp (\Vert c\Vert _\infty /\varepsilon )}{\Vert v^*\Vert _1} \tfrac{\text {d}{\hat{\mu }}}{\text {d}\mu }(x)&\text {for } \mu \text {-a.e.}~x \in X, \end{aligned}$$
    (2.6)
    $$\begin{aligned} \tfrac{1}{\Vert u^*\Vert _1} \tfrac{\text {d}{\hat{\nu }}}{\text {d}\nu }(y)&\le v^*(y) \le \tfrac{\exp (\Vert c\Vert _\infty /\varepsilon )}{\Vert u^*\Vert _1} \tfrac{\text {d}{\hat{\nu }}}{\text {d}\nu }(y)&\text {for } \nu \text {-a.e.}~y \in Y \end{aligned}$$
    (2.7)

    and \(\log u^*\in L^1(X,{\hat{\mu }})\), \(\log v^*\in L^1(Y,{\hat{\nu }})\).

Definition 3

(Dual problem) In addition to (2.4), we will consider a corresponding dual problem which, for the purpose of this article, is best stated as

$$\begin{aligned} {\sup \left\{ J(u,v) \big \vert u \in L^\infty _+(X,\mu ),\,v \in L^\infty _+(Y,\nu ) \right\} } \end{aligned}$$
(2.8)

with \(J : L^\infty _+(X,\mu ) \times L^\infty _+(Y,\nu ) \mapsto [-\infty ,\infty )\) given by

$$\begin{aligned} J: (u,v)&\mapsto \int _X \log u\,\text {d}{\hat{\mu }} + \int _Y \log v\,\text {d}{\hat{\nu }} - \int _{X \times Y} u \otimes v\,\text {d}K+ \Vert K\Vert . \end{aligned}$$
(2.9)

This is a reparametrization of the usual dual problem where one uses \(\varepsilon \log u\) and \(\varepsilon \log v\) as variables instead (cf. [23]). Again, we briefly collect a few helpful results. A proof is given in the “Appendix”.

Proposition 2

(Duality for regularized optimal transport)

  1. (i)

    The functional J, (2.9), is well defined.

  2. (ii)

    For \(u \in L^\infty _+(X,\mu )\), \(\nu \in L^\infty _+(Y,\nu )\), \(\pi \in \Pi ({\hat{\mu }},{\hat{\nu }})\) one has \(J(u,v) \le {{\,\mathrm{KL}\,}}(\pi |K)\).

  3. (iii)

    A transport plan \(\pi ^*\in \Pi ({\hat{\mu }},{\hat{\nu }})\) of the form \(\pi ^*= u^*\otimes v^*\cdot K\) for \(u^*\in L^\infty _+(X,\mu )\), \(v^*\in L^\infty _+(Y,\nu )\) is optimal for (2.4). In that case, \((u^*,v^*)\) are optimal for (2.8). Maximizers for (2.8) exist.

Remark 1

(Sinkhorn algorithm) Problems (2.4) and (2.8) can be solved (approximately) with the Sinkhorn algorithm. For some initial \(u^{(0)} \in L_+^\infty (X,\mu )\) (not identical to zero), it is given for \(\ell =0,1,2,\ldots \) by

$$\begin{aligned} v^{(\ell )}(y)&:=\frac{\tfrac{\text {d}{\hat{\nu }}}{\text {d}\nu }(y)}{\int _X k(x,y)\,u^{(\ell )}(x)\,\text {d}\mu (x)},&u^{(\ell +1)}(x)&:=\frac{\tfrac{\text {d}{\hat{\mu }}}{\text {d}\mu }(x)}{\int _Y k(x,y)\,v^{(\ell )}(x)\,\text {d}\nu (x)}. \end{aligned}$$

We refer to these two steps as Y- and X-iteration. It is quickly verified that a Y-iteration corresponds to a partial optimization of J, (2.9), over v for fixed u. Conversely, an X-iteration is obtained by optimizing over u for fixed v. Hence the sequence \((u^{(\ell )},v^{(\ell )})_\ell \) is a maximizing sequence of (2.9). Complementarily, the sequence of measures \(\pi ^{(\ell )} :=(u^{(\ell )} \otimes v^{(\ell )}) \cdot K\) converges (under suitable assumptions) to the solution of (2.4). In general one has \(\text {P}_Y \pi ^{(\ell )} = \nu \) but \(\text {P}_X \pi ^{(\ell )} \ne \mu \). Conversely, for \(\pi ^{(\ell +1/2)} :=(u^{(\ell +1)} \otimes v^{(\ell )}) \cdot K\) one has \(\text {P}_X \pi ^{(\ell +1/2)} = \mu \) but \(\text {P}_Y \pi ^{(\ell +1/2)} \ne \nu \). Again, for a thorough overview we refer to [19, Chapter 4]. A computationally efficient implementation is discussed in [23], which we use as a reference in this article.

3 Entropic domain decomposition algorithm

Throughout this article we will be concerned with solving

$$\begin{aligned} {\min \left\{ \varepsilon \,{{\,\mathrm{KL}\,}}(\pi |K) \,\big \vert \,\pi \in \Pi (\mu ,\nu ) \right\} } \end{aligned}$$
(3.1)

for \(\mu \in \mathcal {M}_1(X)\), \(\nu \in \mathcal {M}_1(Y)\), \(c \in L^\infty _{+}(X \times Y,\mu \otimes \nu )\), \(\varepsilon >0\), \(k:=\exp (-c/\varepsilon )\), \(K:=k\cdot (\mu \otimes \nu )\). By Proposition 1, (3.1) has a unique minimizer \(\pi ^*\) that can be represented as \(\pi ^*= (u^*\otimes v^*) \cdot K\) for corresponding scaling factors \(u^*\), \(v^*\).

Following Benamou [2] (see Sect. 1.1), the strategy of the algorithm is as follows: Divide the space X into two partitions \(\mathcal {J}_A\) and \(\mathcal {J}_B\) (which imply two partitions of the product space \(X \times Y\)) that need to ‘overlap’ in a suitable sense. Then, starting from an initial feasible coupling \(\pi ^0 \in \Pi (\mu ,\nu )\), optimize the coupling on each of the cells of \(\mathcal {J}_A\) separately, while keeping the marginals on that cell fixed. This can be done independently and in parallel for each cell and the coupling attains a potentially better score while remaining feasible. Then repeat this step on partition \(\mathcal {J}_B\), then again on \(\mathcal {J}_A\) and so on, continuing to alternate between the two partitions.

To construct \(\mathcal {J}_A\) and \(\mathcal {J}_B\) we first define a basic partition of small cells. The two overlapping partitions are then created by suitable merging of the basic cells. This induces a graph over the basic partition cells as vertices. The algorithm converges when this graph is connected, see Sect. 4 for details.

Definition 4

(Basic and composite partitions) A partition of X into measurable sets \(\{X_i\}_{i \in I}\), for some finite index set I, is called a basic partition of \((X,\mu )\) if the measures \(\mu _i :=\mu {{\llcorner }}X_i\) for \(i \in I\) satisfy \(\Vert \mu _i\Vert >0\). By construction one has \(\sum _{i \in I} \mu _i = \mu \). In addition we will write \(K_i :=K{{\llcorner }}(X_i \times Y)\) for \(i \in I\). Often we will refer to a basic partition merely by the index set I.

For a basic partition \(\{X_i\}_{i \in I}\) of \((X,\mu )\) a composite partition \(\mathcal {J}\) is a partition of I. For \(J \in \mathcal {J}\) we will use the following notation:

$$\begin{aligned} X_J&:=\bigcup _{i \in J} X_i,&\mu _J&:=\sum _{i \in J} \mu _i=\mu {{\llcorner }}X_J,&K_J&:=\sum _{i \in J} K_i = K{{\llcorner }}(X_J \times Y). \end{aligned}$$

Of course, the family \(\{X_J\}_{J \in \mathcal {J}}\) is a measurable partition of X and the families \(\{X_i \times Y\}_{i \in I}\) and \(\{X_J \times Y\}_{J \in \mathcal {J}}\) are measurable partitions of \(X \times Y\).

Throughout the article we will use one basic partition I and two corresponding composite partitions \(\mathcal {J}_A\) and \(\mathcal {J}_B\). A formal statement of the algorithm is given in Algorithm 1.

Example 1

In the three-cell example from Sect. 1.1, the basic partition is given by \(\{X_1,X_2,X_3\}\), \(I=\{1,2,3\}\) and the two composite partitions are given by \(\mathcal {J}_A=\{\{1,2\},\{3\}\}\) and \(\mathcal {J}_B=\{\{1\},\{2,3\}\}\). For \(\ell \) odd, i.e. during an \(\mathcal {J}_A\)-iteration, the optimization on the ‘single cell’ \(\{3\}\) can be skipped since it is made redundant by the optimization on \(\{2,3\}\) in the subsequent \(\mathcal {J}_B\)-iteration: One has \(P_Y \pi ^{(\ell )} {{\llcorner }}(X_{3} \times Y))=P_Y \pi ^{(\ell -1)} {{\llcorner }}(X_{3} \times Y))\) and thus, when computing the partial marginal for cell \(\{2,3\}\) in the next iteration one finds

$$\begin{aligned}\nu _{\{2,3\}}^{(\ell +1)}=P_Y(\pi ^{(\ell )} {{\llcorner }}(X_{\{2,3\}} \times Y))=P_Y(\pi ^{(\ell )} {{\llcorner }}(X_{2} \times Y))+P_Y(\pi ^{(\ell )} {{\llcorner }}(X_{3} \times Y)) \\ =P_Y(\pi ^{(\ell )} {{\llcorner }}(X_{2} \times Y))+P_Y(\pi ^{(\ell -1)} {{\llcorner }}(X_{3} \times Y)). \end{aligned}$$

The same holds for the cell \(\{1\}\) during the \(\mathcal {J}_B\)-iteration. We then obtain the method described in the introduction.

figure a

Proposition 3

Algorithm 1 is well-defined and for all \(\ell \in \mathbb {N}_0\) one has \(\pi ^{(\ell )} \in \Pi (\mu ,\nu )\).

Proof

\(\pi ^{(0)} \in \Pi (\mu ,\nu )\) by assumption. Assume now \(\pi ^{(\ell -1)} \in \Pi (\mu ,\nu )\) for some \(\ell \ge 1\). Since

$$\begin{aligned} \mu _J=\mu {{\llcorner }}X_J = (\text {P}_X \pi ^{(\ell -1)}) {{\llcorner }}X_J = \text {P}_X (\pi ^{(\ell -1)} {{\llcorner }}(X_J \times Y)) \end{aligned}$$

we find that \(\Vert \nu ^{(\ell )}_J\Vert =\Vert \mu _J\Vert >0\) (for \(\nu ^{(\ell )}_J\) as introduced in line 7). Also, since \(0 \le \mu _J \le \mu \) and \(0 \le \nu ^{(\ell )}_J\le \nu \), the values of \(\tfrac{\text {d}\nu ^{(\ell )}_J}{\text {d}\nu }\) and \(\tfrac{\text {d}\mu ^{(\ell )}_J}{\text {d}\mu }\) are (up to negligible sets) contained in [0, 1]. Therefore, by Proposition 1, the regularized optimal transport problem in line 8 has a unique solution and thus \(\pi ^{(\ell )}_J\) is well-defined. We verify

$$\begin{aligned} \text {P}_X \pi ^{(\ell )}&=\sum _{J \in \mathcal {J}^{(\ell )}} \text {P}_X \pi ^{(\ell )}_J = \sum _{J \in \mathcal {J}^{(\ell )}} \mu _J = \mu , \\ \text {P}_Y \pi ^{(\ell )}&=\sum _{J \in \mathcal {J}^{(\ell )}} \text {P}_Y \pi ^{(\ell )}_J = \sum _{J \in \mathcal {J}^{(\ell )}} \nu ^{(\ell )}_J = \sum _{J \in \mathcal {J}^{(\ell )}} \text {P}_Y (\pi ^{(\ell -1)} {{\llcorner }}(X_J \times Y)) = \text {P}_Y \pi ^{(\ell -1)} = \nu . \end{aligned}$$

Therefore, \(\pi ^{(\ell )} \in \Pi (\mu ,\nu )\) and the claim follows by induction. \(\square \)

Definition 5

(Notation of iterates) For \(\ell \ge 1\) the following sequences will be used throughout the rest of this article:

  • Composite partitions \(\mathcal {J}^{(\ell )} :=\mathcal {J}_A\) if \(\ell \) is odd, \(\mathcal {J}_B\) else, as defined in Algorithm 1.

  • Primal iterates \(\pi ^{(\ell )}\), as defined in Algorithm 1.

  • Partial primal iterates and corresponding Y-marginals on the composite partition cells,

    $$\begin{aligned} \pi ^{(\ell )}_J&:=\pi ^{(\ell )} {{\llcorner }}(X_J \times Y),&\nu ^{(\ell )}_J&:=\text {P}_Y \pi ^{(\ell )}_J \end{aligned}$$

    for \(J \in \mathcal {J}^{(\ell )}\), consistent with their definition in Algorithm 1. Let \(u^{(\ell )}_J\), \(v^{(\ell )}_J\) be corresponding scaling factors such that \(\pi ^{(\ell )}_J = u^{(\ell )}_J \otimes v^{(\ell )}_J \cdot K\). Their existence is provided by Proposition 1.

  • Partial primal iterates and corresponding Y-marginals on the basic partition cells,

    $$\begin{aligned} \pi ^{(\ell )}_i&:=\pi ^{(\ell )} {{\llcorner }}(X_i \times Y),&\nu ^{(\ell )}_i&:=\text {P}_Y \pi ^{(\ell )}_i, \end{aligned}$$

    for \(i \in I\). It is easy to verify that for

    $$\begin{aligned} u^{(\ell )}_i(x)&:={\left\{ \begin{array}{ll} u^{(\ell )}_J(x) &{} \text {if } x \in X_i, \\ 0 &{} \text {else,} \end{array}\right. },&v^{(\ell )}_i(y)&:=v^{(\ell )}_J(y) \end{aligned}$$
    (3.2)

    where \(J \in \mathcal {J}^{(\ell )}\) is uniquely determined by the condition \(i \in J\), one finds \(\pi ^{(\ell )}_i = u^{(\ell )}_i \otimes v^{(\ell )}_i \cdot K\) and therefore, by Propositions 1 and 2 , \(\pi ^{(\ell )}_i\) is the unique optimal entropic coupling between \(\mu _i\) and \(\nu ^{(\ell )}_i\). This also follows from a restriction argument as in [26, Theorem 4.6].

Based on Proposition 1, partial iterates satisfy locally some corresponding bounds, which we collect in the following Lemma, the proof of which is postponed to the “Appendix”.

Lemma 1

(Boundedness of scaling factors and density bounds) Let \(\ell > 1\), \(i \in I\) and \(J \in \mathcal {J}^{(\ell )}\) such that \(i \in J\).

  1. (i)

    There exist \({\bar{C}}, {C} > 0\), possibly depending on \(\ell \) and J, such that

    $$\begin{aligned} \left\{ \begin{array}{ll} &{} {C} \le u^{(\ell )}_J(x) \le {\bar{C}} \quad \text {for } \mu _J\text {-a.e.}~ x \in X_J \\ &{} {C} \cdot \tfrac{\text {d}\nu _J^{(\ell )}}{\text {d}\nu }(y) \le v^{(\ell )}_J(y) \le {\bar{C}} \quad \text {for } \nu \text {-a.e.}~ y \in Y \end{array} \right. \end{aligned}$$

    and, in particular, \(\log u^{(\ell )}_J \in L^1(X, \mu _J)\) and \(\log v^{(\ell )}_J \in L^1(Y, \nu ^{(\ell )}_J)\).

  2. (ii)

    The ratios of the \(u^{(\ell )}_J\)-scaling factor are bounded,

    $$\begin{aligned} \frac{u_J^{(\ell )}(x_1)}{u_J^{(\ell )}(x_2)} \le \exp (\Vert c\Vert /\varepsilon ) \quad \hbox {for } \mu _J \hbox {-a.e.~} x_1, x_2 \in X_J. \end{aligned}$$
    (3.3)
  3. (iii)

    Appropriate versions of i and ii hold for \(u^{(\ell )}_i\), \(v^{(\ell )}_i\) via their definition, (3.2).

  4. (iv)

    The product of the scaling factors \(u^{(\ell )}_i\), \(v^{(\ell )}_i\) is locally controlled by

    $$\begin{aligned} \tfrac{\text {d}\nu ^{(\ell )}_i}{\text {d}\nu }(y) \cdot \frac{\exp (-\Vert c\Vert /\varepsilon )}{\Vert \mu _i\Vert } \le u^{(\ell )}_i(x) \cdot v^{(\ell )}_i(y)&\le \frac{\exp (2\Vert c\Vert /\varepsilon )}{\Vert \mu _i\Vert } \end{aligned}$$
    (3.4)

    for \(\mu _i\)-a.e. \(x \in X_i\) and \(\nu \)-a.e. \(y \in Y\).

  5. (v)

    For any \(j \in J\), one has

    $$\begin{aligned} u^{(\ell )}_{i}(x) \cdot v^{(\ell )}_{i}(y) \ge \exp (-3\Vert c\Vert /\varepsilon )\,\Vert \mu _{j}\Vert \cdot u^{(\ell -1)}_{j}(x') \cdot v^{(\ell -1)}_{j}(y) \end{aligned}$$
    (3.5)

    for \((\mu _i \otimes \mu _{j} \otimes \nu )\)-a.e. \((x,x',y) \in X_i \times X_j \times Y\), and

    $$\begin{aligned} \tfrac{\text {d}\nu ^{(\ell )}_i}{\text {d}\nu }(y) \ge \exp (-2\Vert c\Vert /\varepsilon )\,\Vert \mu _i\Vert \cdot \tfrac{\text {d}\nu ^{(\ell -1)}_{j}}{\text {d}\nu }(y) \quad \text {for}~ \nu ~\text {-a.e.}~{y \in Y.} \end{aligned}$$
    (3.6)

To provide some intuition, we now sketch a simple proof that Algorithm 1 converges to the globally optimal solution. For simplicity, we restrict it to the three-cell problem and discrete (and finite) spaces X and Y. While it could be extended to the general setting, we instead refer to Sect. 4.

Proposition 4

Let X and Y be finite sets. Let \(\{X_1,X_2,X_3\}\), \(I=\{1,2,3\}\) and \(\mathcal {J}_A=\{\{1,2\},\{3\}\}\) and \(\mathcal {J}_B=\{\{1\},\{2,3\}\}\) be a basic and two composite partitions of X. Then Algorithm 1 converges to the globally optimal solution of problem (3.1).

Proof

(Sketch) When X and Y are discrete, the spaces of measures and measurable functions all become finite-dimensional real vector spaces. For simplicity, we may assume that \(\mu \) and \(\nu \) assign strictly positive mass to every point in X and Y. Then \(K\) assigns strictly positive mass to every point in \(X \times Y\) and therefore \(\pi \ll K\) for any \(\pi \in \mathcal {M}_+(X \times Y)\). The function \({{\,\mathrm{KL}\,}}(\cdot |K)\) is therefore finite and continuous on \(\mathcal {M}_+(X \times Y)\). Denote by S the ‘solving’ map \(\mathcal {M}_+(X \times Y) \ni \pi \mapsto \arg \min \{ \varepsilon \,{{\,\mathrm{KL}\,}}({\hat{\pi }}|K) | {\hat{\pi }} \in \Pi (\text {P}_X \pi ,\text {P}_Y \pi )\}\) which is well-defined.

We show that S is continuous: Let \((\pi _n)_n\) be a sequence of measures in \(\mathcal {M}_+(X \times Y)\) converging to \(\pi _\infty \) (this implies that the entries of \(\pi _n\) are uniformly bounded), then \(S(\pi _n)=u_n \otimes v_n \cdot K\) for suitable \(u_n\), \(v_n\) by Proposition 1. By re-scaling, with (2.6, 2.7) we may assume that \(u_n\) and \(v_n\) are uniformly bounded and by compactness in finite dimensions, we may extract some cluster points \(u_\infty \) and \(v_\infty \). Since \(u_n \otimes v_n \cdot K\in \Pi (\text {P}_X \pi _n, \text {P}_Y \pi _n)\), one finds that \(u_\infty \otimes v_\infty \cdot K\in \Pi (\text {P}_X \pi _\infty ,\text {P}_Y \pi _\infty )\}\), which must be equal to \(S(\pi _\infty )\) by Proposition 2 (iii). Therefore, \(S(\pi _n)\) converges to \(S(\pi _\infty )\) and S is continuous.

Let now \(F_A\) be the map that takes an iterate \(\pi ^{(\ell -1)}\) to \(\pi ^{(\ell )}\) when \(\ell \) is odd, and \(F_B\) the map for \(\ell \) even. In the discrete setting, the restriction of a measure to a subset is a continuous map, and since \(F_A\) and \(F_B\) are built from measure restriction and the solving map S, they are continuous.

Consider now the sequence of odd-numbered iterates \((\pi ^{(2\ell +1)})_\ell \). In finite dimensions we may extract a cluster point \(\pi ^*\). By continuity of \(F_A\) and \(F_B\), \(F_A(F_B(\pi ^*))\) must also be a cluster point of \((\pi ^{(2\ell +1)})_\ell \). Since the sequence \(({{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|K))_\ell \) is non-increasing (and \({{\,\mathrm{KL}\,}}(\cdot |K)\) is continuous here), we must have \({{\,\mathrm{KL}\,}}(\pi ^*|K)={{\,\mathrm{KL}\,}}(F_A(F_B(\pi ^*))|K)\). But \(F_A(F_B(\pi ^*)) \ne \pi ^*\) would imply that the \({{\,\mathrm{KL}\,}}\) divergence has been strictly decreased, so we must have that \(F_A(F_B(\pi ^*))=F_B(\pi ^*)=\pi ^*\), i.e. \(\pi ^*\) is partially optimal on all the cells of the composite partitions \(\mathcal {J}_A\) and \(\mathcal {J}_B\). So there are scaling factors \((u_{\{1,2\}},v_{\{1,2\}})\), \((u_{\{3\}},v_{\{3\}})\) and \((u_{\{1\}},v_{\{1\}})\), \((u_{\{2,3\}},v_{\{2,3\}})\) such that we can write

$$\begin{aligned} \pi ^*= u_{\{1,2\}} \otimes v_{\{1,2\}} \cdot K_{\{1,2\}} + u_{\{3\}} \otimes v_{\{3\}} \cdot K_{\{3\}} = u_{\{1\}} \otimes v_{\{1\}} \cdot K_{\{1\}} + u_{\{2,3\}} \otimes v_{\{2,3\}} \cdot K_{\{2,3\}}. \end{aligned}$$

We find that on \(X_2 \times Y\) the functions \(u_{\{1,2\}} \otimes v_{\{1,2\}}\) and \(u_{\{2,3\}} \otimes v_{\{2,3\}}\) must coincide \(K\)-a.e., and by re-scaling we may thus assume that \(u_{\{1,2\}}=u_{\{2,3\}}\) on \(X_2\) \(\mu \)-a.e., and \(v_{\{1,2\}}=v_{\{2,3\}}\) on Y \(\nu \)-a.e. and therefore we can ‘glue’ together \(u_{\{1,2\}}\) and \(u_{\{2,3\}}\) to get some u and set \(v=v_{\{1,2\}}=v_{\{2,3\}}\) such that \(\pi ^*=u \otimes v \cdot K\), which must therefore be globally optimal by Proposition 2 (iii). Since this holds for any cluster point \(\pi ^*\), \((\pi ^{(\ell )})_\ell \) converges to this point. \(\square \)

4 Linear convergence

In this Section we prove linear convergence of Algorithm 1 to the globally optimal solution of (3.1) with respect to the Kullback–Leibler divergence. We start with the proof for the three-cell setup in Sect. 4.1 and finally the proof for general decompositions in Sect. 4.2. Numerical worst-case examples that demonstrate the qualitative accuracy of our convergence rate bounds are presented in Sect. 4.3.

Definition 6

(Primal suboptimality) For \(\pi \in \Pi (\mu ,\nu )\) the (scaled) suboptimality of \(\pi \) for problem (3.1) is denoted by

$$\begin{aligned} \varDelta (\pi )&:={{\,\mathrm{KL}\,}}(\pi |K)-{{\,\mathrm{KL}\,}}(\pi ^*|K) \end{aligned}$$
(4.1)

where we recall that \(\pi ^*\) denotes the unique minimizer.

Remark 2

(Proof strategy) The basic strategy for the proofs is inspired by [17] where linear convergence for block coordinate descent methods is shown. Indeed, the domain decomposition algorithm can be interpreted as a block coordinate descent for the entropic optimal transport problem, where in each iteration we add some additional artificial constraints that fix the Y-marginals of the current candidate coupling on each of the composite cells. The fundamental strategy of [17] is to find a bound \(\varDelta (\pi ^{(\ell )}) \le C \left( \varDelta (\pi ^{(\ell -k)})-\varDelta (\pi ^{(\ell )})\right) \) for some \(C \in \mathbb {R}_+\) where \(k=1\) in the three-cell setting and \(k>1\) in the general setting. This then quickly implies linear convergence of \(\varDelta (\pi ^{(\ell )})\) to 0. In this article, we exploit the identity \(\varDelta (\pi ^{(\ell -k)})-\varDelta (\pi ^{(\ell )})={{\,\mathrm{KL}\,}}(\pi ^{(\ell -k)}|\pi ^{(\ell )})\) (Lemma 3) and we use a primal-dual gap estimate (Proposition 2 (ii)) to bound \(\varDelta (\pi ^{(\ell )}) \le {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|{\hat{\pi }})\) for a suitable \({\hat{\pi }}\) of the form \({\hat{u}} \otimes {\hat{v}} \cdot K\) (Lemma 4) and then use carefully chosen \({\hat{u}}\) and \({\hat{v}}\) to control the latter by the former.

A major challenge is that Algorithm 1 does not produce any ‘full’ dual iterates. We may only use the dual variables for the composite cell problems in line 8, but they are not necessarily consistent across multiple cells. Inspired by the proof of Proposition 4 we propose a way to approximately ‘glue’ them together to obtain a global candidate.

4.1 Three cells

We first state the main result of this section.

Theorem 1

(Linear convergence for three-cell decomposition) Let \((X_1,X_2,X_3)\), \(I=\{1,2,3\}\), be a basic partition of \((X,\mu )\) into three cells and set \(\mathcal {J}_A=\{\{1,2\},\{3\}\}\), \(\mathcal {J}_B=\{\{1\},\{2,3\}\}\). Then, for iterates of Algorithm 1 one has for \(\ell > 1\),

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) \le \left( 1+\exp \left( -\tfrac{2\Vert c\Vert }{\varepsilon }\right) \tfrac{\Vert \mu _2\Vert }{\Vert \mu _{\{1,3\}}\Vert }\right) ^{-1} \cdot \varDelta (\pi ^{(\ell -1)}). \end{aligned}$$
(4.2)

In particular, the domain decomposition algorithm converges to the optimal solution.

The proof for Theorem 1 requires some some auxiliary Lemmas. Most of the Lemmas are already formulated for more general partitions to allow reusing them in the next Section. The Lemmas strongly rely on algebraic properties of the \({{\,\mathrm{KL}\,}}\) divergence that are used in a similar way throughout the literature, see for example [1, Lemma 2].

Lemma 2

Let \(\pi =(u \otimes v) \cdot K\) for \(u \in L^\infty _+(X,\mu )\), \(v \in L^\infty _+(Y,\nu )\). Then

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi |K)&= \left\langle \text {P}_X \pi , \log u \right\rangle + \left\langle \text {P}_Y \pi , \log v \right\rangle - \Vert \pi \Vert + \Vert K\Vert \,. \end{aligned}$$
(4.3)

Similarly, given \(\pi _i = (u_i \otimes v_i) \cdot K_i\) for \(u_i \in L^\infty _+(X,\mu )\), \(v_i \in L^\infty _+(Y,\nu )\), \(i \in I\), and \(\pi = \sum _{i\in I} \pi _i\), then

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi |K)&= \sum _{i\in I} \left[ \left\langle \text {P}_X \pi _i, \log u_i \right\rangle + \left\langle \text {P}_Y \pi _i, \log v_i \right\rangle \right] - \Vert \pi \Vert + \Vert K\Vert . \end{aligned}$$
(4.4)

We briefly recall from Sect. 2.1 at this point that \(\left\langle \rho , f \right\rangle \) denotes integration of the measurable function f against the measure \(\rho \).

Proof

We show the second statement. The first then follows by setting \((X_{i})_{i \in I},I\) to the trivial partition \((X_1:=X)\), \(I:=\{1\}\). Arguing as in Proposition 1 (iv) we find \(\log u_i \in L^1(X,\text {P}_X \pi _i)\), \(\log v_i \in L^1(Y,\text {P}_Y \pi _i)\) and therefore, all integrals in the following are finite. The proof now follows quickly from direct computation:

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi |K)&= \sum _{i \in I} {{\,\mathrm{KL}\,}}(\pi _i|K_i) = \sum _{i \in I} \int _{X \times Y} \varphi \left( \tfrac{\text {d}\pi _i}{\text {d}K_i}\right) \text {d}K_i \\&= \sum _{i \in I} \left[ \int _{X \times Y} \log \left( u_i(x)\,v_i(y)\right) \,\text {d}\pi _i(x,y) - \pi _i(X \times Y)+K_i(X \times Y) \right] \\&= \sum _{i \in I} \left[ \int _{X \times Y} \big [ \log (u_i(x))+ \log (v_i(y)) \big ] \text {d}\pi _i(x,y) \right] - \Vert \pi \Vert +\Vert K\Vert \\&= \sum _{i\in I} \left[ \left\langle \text {P}_X \pi _i, \log u_i \right\rangle + \left\langle \text {P}_Y \pi _i, \log v_i \right\rangle \right] - \Vert \pi \Vert + \Vert K\Vert \end{aligned}$$

\(\square \)

Lemma 3

(Expressions for decrement and sub-optimality) Let \(\ell > 1\). Then,

$$\begin{aligned} \varDelta (\pi ^{(\ell -1)})-\varDelta (\pi ^{(\ell )}) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}|K)-{{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|K) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}|\pi ^{(\ell )}) \end{aligned}$$

and

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|K)-{{\,\mathrm{KL}\,}}(\pi ^*|K) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|\pi ^*). \end{aligned}$$

Proof

Fix \(\ell > 1\). With the notation of Definition 5 we can decompose the current and previous iterate as

$$\begin{aligned} \pi ^{(\ell -1)}&= \sum _{i \in I} u^{(\ell -1)}_i \otimes v^{(\ell -1)}_i \cdot K_i,&\pi ^{(\ell )}&= \sum _{i \in I} u^{(\ell )}_i \otimes v^{(\ell )}_i \cdot K_i, \end{aligned}$$

with all partial scalings being essentially bounded with respect to \(\mu \) or \(\nu \). So, by using Lemma 2, we compute

$$\begin{aligned}&{{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}|K) - {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|K) = \sum _{J \in \mathcal {J}^{(\ell )}} \sum _{i \in J} \Big [{{\,\mathrm{KL}\,}}(\pi _i^{(\ell -1)}|K_i) - {{\,\mathrm{KL}\,}}(\pi _i^{(\ell )}|K_i) \Big ] \nonumber \\&= \sum _{J \in \mathcal {J}^{(\ell )}} \sum _{i \in J} \Big [ \left\langle \text {P}_X \pi ^{(\ell -1)}_i, \log u^{(\ell -1)}_i \right\rangle +\left\langle \text {P}_Y \pi ^{(\ell -1)}_i, \log v^{(\ell -1)}_i \right\rangle \nonumber \\&\qquad \qquad \qquad - \left\langle \text {P}_X \pi ^{(\ell )}_i, \log u^{(\ell )}_i \right\rangle - \left\langle \text {P}_Y \pi ^{(\ell )}_i, \log v^{(\ell )}_i \right\rangle -\Vert \pi ^{(\ell -1)}_i\Vert + \Vert \pi ^{(\ell )}_i\Vert \Big ] \nonumber \\&= \sum _{J \in \mathcal {J}^{(\ell )}} \sum _{i \in J} \Big [ \left\langle \mu _i, \log u^{(\ell -1)}_i \right\rangle +\left\langle \nu ^{(\ell -1)}_i, \log v^{(\ell -1)}_i \right\rangle \nonumber \\&\qquad \qquad \qquad - \left\langle \mu _i, \log u^{(\ell )}_i \right\rangle - \left\langle \nu ^{(\ell )}_i, \log v^{(\ell )}_i \right\rangle -\Vert \pi ^{(\ell -1)}_i\Vert + \Vert \pi ^{(\ell )}_i\Vert \Big ]. \end{aligned}$$
(4.5)

Note now that, for any \(J \in \mathcal {J}^{(\ell )}\), the Y-marginal on each composite cell is kept fixed during the iteration (see Algorithm 1, lines 7-8), i.e.,

$$\begin{aligned} \sum _{i \in J} \nu ^{(\ell )}_i = \text {P}_Y \pi ^{(\ell )}_J = \nu ^{(\ell )}_J = \text {P}_Y \pi ^{(\ell -1)}_J = \sum _{i \in J} \nu ^{(\ell -1)}_i \end{aligned}$$

Hence, since \(v^{(\ell )}_{i} = v^{(\ell )}_{J} = v^{(\ell )}_{j}\) for all \(i, j \in J\) (see (3.2)), we have

$$\begin{aligned} \sum _{i \in J} \left\langle \nu ^{(\ell )}_i, \log v^{(\ell )}_i \right\rangle = \sum _{i \in J} \left\langle \nu ^{(\ell -1)}_i, \log v^{(\ell )}_i \right\rangle . \end{aligned}$$

Thus, we can continue:

$$\begin{aligned} {(4.5)}&= \sum _{J \in \mathcal {J}^{(\ell )}} \sum _{i \in J} \left[ \left\langle {\mu }_i, \log \left( \tfrac{u^{(\ell -1)}_i}{u^{(\ell )}_i}\right) \right\rangle +\left\langle \nu ^{(\ell -1)}_i, \log \left( \tfrac{v^{(\ell -1)}_i}{v^{(\ell )}_i}\right) \right\rangle -\Vert \pi ^{(\ell -1)}_i\Vert + \Vert \pi ^{(\ell )}_i\Vert \right] \\&= \sum _{i \in I} {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_i|\pi ^{(\ell )}_i) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}|\pi ^{(\ell )}). \end{aligned}$$

The proof of the second statement follows the same steps. It hinges on the fact that \(\text {P}_X \pi ^{(\ell -1)}_i=\mu _i=\text {P}_X\pi ^*_i\), all \(v^*_{i}=v^*\) are identical, and \(\sum _{i \in I} \text {P}_Y \pi ^{(\ell -1)}_i = \nu = \sum _{i \in I} \text {P}_Y \pi ^*_i\) so that one has

$$\begin{aligned} \sum _{i \in I} \left\langle \text {P}_Y \pi ^*_i, \log v^*_i \right\rangle =\sum _{i \in I} \left\langle \text {P}_Y \pi ^*_i, \log v^*\right\rangle =\sum _{i \in I} \left\langle \text {P}_Y \pi ^{(\ell -1)}_i, \log v^*\right\rangle =\sum _{i \in I} \left\langle \text {P}_Y \pi ^{(\ell -1)}_i, \log v^*_i \right\rangle . \end{aligned}$$

\(\square \)

Lemma 4

(Primal-dual gap) Let

$$\begin{aligned} {\tilde{\pi }} = \sum _{i \in I} {\tilde{\pi }}_i, \quad {\tilde{\pi }}_i = ({\tilde{u}}_i \otimes {\tilde{v}}_i) \otimes K_i, \quad {\tilde{u}}_i \in L^\infty _+(X,\mu ), \quad {\tilde{v}}_i \in L^\infty _+(Y,\nu ) \quad \text {for} \quad i \in I \end{aligned}$$

and \({\hat{\pi }} = ({\hat{u}} \otimes {\hat{v}}) \cdot K\) with \({\hat{u}} \in L^\infty _+(X,\mu )\), \({\hat{v}} \in L^\infty _+(Y,\nu )\). If \({\tilde{\pi }} \in \Pi (\mu , \nu )\), then

$$\begin{aligned} \varDelta ({\tilde{\pi }}) \le {{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }}). \end{aligned}$$

Proof

If \({{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }})=+\infty \) there is nothing to prove. Therefore, assume \({{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }})<+\infty \). Abbreviating \({\hat{\pi }}_i:={\hat{\pi }} {{\llcorner }}(X_i \times Y)\) for \(i \in I\), we find

$$\begin{aligned} {{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }})&= \sum _{i \in I} {{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i) = \sum _{i\in I} \int _{X \times Y} \varphi \left( \tfrac{{\tilde{u}}_i \otimes {\tilde{v}}_i}{{\hat{u}} \otimes {\hat{v}}}\right) \,\text {d}{\hat{\pi }}_i \\&= \sum _{i \in I} \left[ \int _{X \times Y} \log \left( {\tilde{u}}_i \otimes {\tilde{v}}_i\right) \,\text {d}{\tilde{\pi }}_i - \int _{X \times Y} \log \left( {\hat{u}} \otimes {\hat{v}}\right) \,\text {d}{\tilde{\pi }}_i - \Vert {\tilde{\pi }}_i\Vert + \Vert {\hat{\pi }}_i\Vert \right] \\&= {{\,\mathrm{KL}\,}}({\tilde{\pi }}|K) - \int _X \log {\hat{u}}\,\text {d}\mu - \int _Y \log {\hat{v}}\,\text {d}\nu + \Vert {\hat{\pi }}\Vert - \Vert K\Vert \\&= {{\,\mathrm{KL}\,}}({\tilde{\pi }}|K) - J({\hat{u}},{\hat{v}}) \ge \varDelta ({\tilde{\pi }}). \end{aligned}$$

Here the integral in the first line is finite since \({{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i)\le {{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }})<\infty \). The first integral in the second line is finite since \(\log \left( {\tilde{u}}_i \otimes {\tilde{v}}_i\right) \in L^1(X \times Y,{\tilde{\pi }}_i)\), arguing as in Proposition 1 (iv), and therefore so is the second. Since all scaling factors are essentially bounded, we can split the products within the logarithms into separate integrals in the third line where we also used \({\tilde{\pi }} \in \Pi (\mu ,\nu )\). In the fourth line we use the definition of J, (2.9). The last inequality is due to Proposition 2, since for the primal minimizer \(\pi ^*\) we find \({{\,\mathrm{KL}\,}}(\pi ^*|K) \ge J({\hat{u}},{\hat{v}})\) and the primal-dual gap is thus a bound for the suboptimality (4.1). \(\square \)

Proof

(Theorem 1) Assume first that \(\ell >1\) is odd, and so the current composite partition is \(\mathcal {J}_A=\{\{1,2\},\{3\}\}\). On one hand, using Lemma 3, one finds

$$\begin{aligned} \varDelta (\pi ^{(\ell -1)})-\varDelta (\pi ^{(\ell )}) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}|\pi ^{(\ell )}) \ge {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_2|\pi ^{(\ell )}_2). \end{aligned}$$
(4.6)

On the other hand, with Lemma 4, one can bound \(\varDelta (\pi ^{(\ell )}) \le {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|{\hat{\pi }})\) for a suitable \({\hat{\pi }}\) of the form \({\hat{\pi }}=({\hat{u}} \otimes {\hat{v}}) \cdot K\). Using a piecewise definition on the partition cells for \({\hat{u}}\), we set here

$$\begin{aligned} {\hat{u}} :=u^{(\ell )}_{\{1,2\}} + q \cdot u^{(\ell -1)}_{\{3\}} \qquad \text {and} \qquad {\hat{v}} :=v^{(\ell )}_{\{1,2\}} \end{aligned}$$
(4.7)

where the factor \(q \in \mathbb {R}_{++}\) is to be determined later (observe that \({\hat{u}}\) and \({\hat{v}}\) are upper bounded, thanks to Lemma 1). With this choice of \({\hat{\pi }}\) we find that \(\pi ^{(\ell )}_{\{1,2\}}={\hat{\pi }}_{\{1,2\}}\) and thus

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) \le {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|{\hat{\pi }}) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}_3|{\hat{\pi }}_3)\,. \end{aligned}$$
(4.8)

To complete the proof we must now find a constant \(C \in \mathbb {R}_+\) such that

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}_3|{\hat{\pi }}_3) \le C \cdot {{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_2|\pi ^{(\ell )}_2)\,. \end{aligned}$$

To this end we write

$$\begin{aligned}&{{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_2|\pi ^{(\ell )}_2) = \int _{X \times Y} \varphi \left( \frac{\text {d}\pi ^{(\ell -1)}_2}{\text {d}\pi ^{(\ell )}_2}\right) \, \text {d}\pi ^{(\ell )}_2 \nonumber \\&= \int _{X \times Y} \varphi \left( \frac{u^{(\ell -1)}_2(x)\,v^{(\ell -1)}_2(y)}{u^{(\ell )}_2(x)\,v^{(\ell )}_2(y)}\right) \, \frac{u^{(\ell )}_2(x)}{u^{(\ell -1)}_2(x)}u^{(\ell -1)}_2(x)\, v^{(\ell )}_2(y)\,k(x,y)\,\text {d}\mu _2(x)\,\text {d}\nu (y). \end{aligned}$$
(4.9)

Now apply (3.3) at iterate \((\ell -1)\) and for \(J = \{2,3\}\) to obtain that for \(\mu \)-a.e. \(x \in X_2, x_3 \in X_3\) and for \(\nu \)-a.e. \(y \in Y\) it holds

$$\begin{aligned} u^{(\ell -1)}_2(x)\,k(x,y)&\ge u^{(\ell -1)}_3(x_3)\,k(x_3,y)\,\cdot \exp \left( -2\Vert c\Vert /\varepsilon \right) . \end{aligned}$$
(4.10)

Further, note that for \(a \ge 0\) the map \(\Phi : s \mapsto \big (\varphi \big (\tfrac{a}{s}\big ) \cdot s\big )\) is convex on \(\mathbb {R}_{++}\) and thus by Jensen’s inequality one has for \(\nu ^{(\ell )}_2\)-a.e. \(y \in Y\) (such that \(v^{(\ell )}_2(y)>0\) almost surely) that

$$\begin{aligned} \int _{X} \varphi \left( \frac{u^{(\ell -1)}_2(x)\,v^{(\ell -1)}_2(y)}{u^{(\ell )}_2(x)\,v^{(\ell )}_2(y)}\right) \, \frac{u^{(\ell )}_2(x)}{u^{(\ell -1)}_2(x)}\,\text {d}\mu _2(x) \ge \Vert \mu _2\Vert \cdot \varphi \left( \frac{v^{(\ell -1)}_2(y)}{q\,v^{(\ell )}_2(y)}\right) \,q \end{aligned}$$
(4.11)

where we now fix the value \(q :=\fint _X \frac{u^{(\ell )}_2}{u^{(\ell -1)}_2}\,\text {d}\mu _2\) (which is finite because of the lower and upper boundedness of \(u^{(\ell )}_2\) and \(u^{(\ell -1)}_2\), cf. Lemma 1 (,(iii)). We recall the notation for the normalized integral from (2.1). Plugging first (4.10) and then (4.11) into (4.9) one obtains for \(\mu \)-almost all \(x_3 \in X_3\)

$$\begin{aligned}&{{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_2|\pi ^{(\ell )}_2) \ge e^{-\frac{2\Vert c\Vert }{\varepsilon }} \cdot \Vert \mu _2\Vert \cdot \int _{Y} \varphi \left( \frac{v^{(\ell -1)}_2(y)}{q\,v^{(\ell )}_2(y)}\right) \,\nonumber \\&q\,u^{(\ell -1)}_3(x_3)\,v^{(\ell )}_2(y)\,k(x_3,y)\,\text {d}\nu (y). \end{aligned}$$
(4.12)

Now we work backwards from (4.8). With the specified choices for \({\hat{u}}\) and \({\hat{v}}\) made in (4.7), one finds for \({\hat{\pi }}\)-a.e. \((x,y) \in X_3 \times Y\)

$$\begin{aligned} \frac{\text {d}\pi ^{(\ell )}_3}{\text {d}{\hat{\pi }}}(x,y)&= \frac{u^{(\ell )}_3(x) \cdot v^{(\ell )}_3(y)}{q \cdot u^{(\ell -1)}_3(x) \cdot v^{(\ell )}_{\{1,2\}}(y)} = \frac{u^{(\ell -1)}_3(x) \cdot v^{(\ell -1)}_3(y)}{q \cdot u^{(\ell -1)}_3(x) \cdot v^{(\ell )}_2(y)} = \frac{v^{(\ell -1)}_2(y)}{q \cdot v^{(\ell )}_2(y)} \end{aligned}$$

where we have used that \(u^{(\ell )}_3 \otimes v^{(\ell )}_3= u^{(\ell -1)}_3 \otimes v^{(\ell -1)}_3\) (since \(\pi ^{(\ell )}_3=\pi ^{(\ell -1)}_3\) due to \(\{3\} \in \mathcal {J}_A\)) and \(v^{(\ell -1)}_3=v^{(\ell -1)}_2\) (since \(\{2,3\} \in \mathcal {J}_B\)). Consequently, with (4.12)

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}_3|{\hat{\pi }}_3)&= \int _{X \times Y} \varphi \left( \frac{v^{(\ell -1)}_2(y)}{q \cdot v^{(\ell )}_2(y)}\right) q\, u^{(\ell -1)}_3(x)\,v^{(\ell )}_2(y)\,k(x,y)\,\text {d}\mu _3(x) \text {d}\nu (y) \\&\le e^{\frac{2\Vert c\Vert }{\varepsilon }} \cdot \frac{1}{\Vert \mu _2\Vert } \int _{X} \text {d}\mu _3(x) \,{{\,\mathrm{KL}\,}}(\pi ^{(\ell -1)}_2|\pi ^{(\ell )}_2) \end{aligned}$$

and with (4.6) and (4.8)

$$\begin{aligned} \varDelta (\pi ^{(\ell )})&\le \exp \left( \tfrac{2\Vert c\Vert }{\varepsilon }\right) \cdot \frac{\Vert \mu _3\Vert }{\Vert \mu _2\Vert } \left( \varDelta (\pi ^{(\ell -1)})-\varDelta (\pi ^{(\ell )})\right) . \end{aligned}$$

For \(\ell \) even one proceed analogously to obtain

$$\begin{aligned} \varDelta (\pi ^{(\ell )})&\le \exp \left( \tfrac{2\Vert c\Vert }{\varepsilon }\right) \cdot \frac{\Vert \mu _1\Vert }{\Vert \mu _2\Vert } \left( \varDelta (\pi ^{(\ell -1)})-\varDelta (\pi ^{(\ell )})\right) . \end{aligned}$$

Hence, for any \(\ell > 0\), using \(\Vert \mu _{\{1,3\}}\Vert \ge \max \{\Vert \mu _1\Vert ,\Vert \mu _3\Vert \}\), one obtains

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) \le \exp \left( \tfrac{2\Vert c\Vert }{\varepsilon }\right) \tfrac{\Vert \mu _{\{1,3\}}\Vert }{\Vert \mu _2\Vert } \cdot \big (\varDelta (\pi ^{(\ell -1)})-\varDelta (\pi ^{(\ell )})\big ) \end{aligned}$$

which implies (4.2). Eventually, the sequence \((\pi ^{(\ell )})_\ell \) is bounded and each element lies in \(\Pi (\mu ,\nu )\). As \(\Pi (\mu ,\nu )\) is weakly-\(*\) closed, \((\pi ^{(\ell )})_\ell \) must have a cluster point \(\pi ^{(\infty )} \in \Pi (\mu ,\nu )\). By (4.2) and since \(\varDelta \) is weakly-\(*\) lower semi-continuous one has

$$\begin{aligned} \varDelta (\pi ^{(\infty )}) \le \liminf _{\ell \rightarrow \infty } \varDelta (\pi ^{(\ell )})=0\,. \end{aligned}$$

Since \([\varDelta (\pi )={{\,\mathrm{KL}\,}}(\pi |\pi ^*)=0]\) \(\Leftrightarrow \) \([\pi =\pi ^*]\) one has \(\pi ^{(\infty )} = \pi ^*\). This also implies that every cluster point of \((\pi ^{(\ell )})_\ell \) must equal \(\pi ^*\) and that thus the sequence is converging. \(\square \)

4.2 General decompositions

Now let I be a basic partition of \((X,\mu )\) into N basic cells and let \(\mathcal {J}_A\) and \(\mathcal {J}_B\) be two composite partitions of I. In the special case of three cells discussed in the previous Section, convergence of the algorithm was driven by the overlap of the two partitions on the middle cell. In the general case this overlap structure will be captured by the partition graph and related auxiliary objects that we introduce now.

Definition 7

(Partition graph, cell distance and shortest paths)

  1. (i)

    The partition graph is given by the vertex set I (i.e. each basic cell is represented by one vertex) and the edge set

    $$\begin{aligned} {E :=\left\{ (i,j) \in I \,\big \vert \, \exists \, J \in \mathcal {J}_A\cup \mathcal {J}_B\text { such that } \{i,j\} \subset J\right\} ,} \end{aligned}$$

    i.e. there is an edge between two basic cells if they are part of the same composite cell in either of the two composite partitions \(\mathcal {J}_A\) or \(\mathcal {J}_B\).

  2. (ii)

    We assume that no two basic cells are simultaneously contained in a composite cell of \(\mathcal {J}_A\) and \(\mathcal {J}_B\), i.e. there exist no \(i,j \in I\), \(J_A \in \mathcal {J}_A\), \(J_B \in \mathcal {J}_B\) such that \(i,j \in J_A\) and \(i,j \in J_B\) (otherwise i and j should be merged into one basic cell). This means that every edge in the partition graph is associated to precisely one of the two composite partitions.

  3. (iii)

    We assume that the partition graph is connected.

  4. (iv)

    We denote by \({{\,\mathrm{dist}\,}}: I \times I \rightarrow \mathbb {N}_0\) the discrete metric on this graph induced by shortest paths, where each edge has length 1.

  5. (v)

    Let \(J_0 \in \mathcal {J}_A\) be a selected composite cell. We define the cell distance

    $$\begin{aligned} D : I&\rightarrow \mathbb {N}_0,\quad i \mapsto {{\,\mathrm{dist}\,}}(i,J_0) = \min \{ {{\,\mathrm{dist}\,}}(i,j) \,|\, j \in J_0\} \end{aligned}$$

    and write \(M :=\max \{D(i) \,|\, i \in I\}\).

  6. (vi)

    For each \(i \in I\) we select one shortest path in the graph from \(J_0\) to i, which we represent by a tuple \((n_{i,k})_{k=0}^{D(i)}\) of elements in I with \(n_{i,0} \in J_0\), \(n_{i,D(i)}=i\), and \(D(n_{i,k})=k\) for \(k \in \{0,\ldots ,D(i)\}\). We refer to Fig. 2 for an illustration of such a construction. One can easily verify that, within any shortest path, the basic cells \(n_{i,k}\) and \(n_{i,k-1}\) belong to a common composite cell of partition \(\mathcal {J}_A\) when k is even, and \(\mathcal {J}_B\) when k is odd, or equivalently, of \(\mathcal {J}^{(\ell -k)}\) for some odd \(\ell > M\), see Fig. 3.

Let us fix for the rest of this Section an odd iterate \(\ell \) so that \(\mathcal {J}^{(\ell )}=\mathcal {J}_A\) (the same will then apply for \(\ell \) even by swapping the roles of \(\mathcal {J}_A\) and \(\mathcal {J}_B\)).

Fig. 2
figure 2

Illustration of partition graph and cell distance for a typical 2D setup. Vertices, the elements of I, are represented by circles, with cell distance D(i) given as labels. Composite cells of \(\mathcal {J}_A\) are indicated by rectangles with solid lines, \(J_0\) is highlighted with grey filling. Composite cells of \(\mathcal {J}_B\) are indicated by rectangles with dashed lines. Shortest paths from vertices to \(J_0\) are indicated by black lines (these are often not unique). For the vertex highlighted in grey the indices \((n_{i,k})_{k=0}^{D(i)}\) are given below the corresponding nodes

Fig. 3
figure 3

Illustration of proof strategies for Lemma 5 and Theorem 2. The horizontal axis represents basic cells along the shortest path \((n_{i,k})_{k=0}^{D(i)}\) from \(J_0 \in \mathcal {J}_A\) (left side) to some \(i \in I\) (right side), vertical axis represents iterations for some odd \(\ell >M\) (i.e. \(\mathcal {J}^{(\ell )}=\mathcal {J}_A\)). Solid vertical lines represent boundaries of composite cells at given iteration. Grey shading of boxes illustrates the construction of \({\tilde{\pi }}\) in (4.14). Vertical arrows represents factors \(q_{i,k}\), cf. (4.17). Horizontal arrows represent edges in the partition graph and the equalities \(v_{n_{i,k}}^{(\ell -k)}=v_{n_{i,k-1}}^{(\ell -k)}\) used in (4.20)

Theorem 2

(Linear convergence) Let \(\ell > M\), with M defined in Definition 7 (v). Then, one finds

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) \le \frac{2MN\exp ((6M+7)\,\Vert c\Vert /\varepsilon )}{\mu _{\text {min}}^{2M+1}+2MN\exp ((6M+7)\,\Vert c\Vert /\varepsilon )} \cdot \varDelta (\pi ^{(\ell -M)}). \end{aligned}$$
(4.13)

where \(\mu _{\text {min}}:=\min \{\Vert \mu _i\Vert | i \in I\}\). In particular, the domain decomposition algorithm converges to the optimal solution.

Remark 3

(Proof strategy) For the three-cell case the proof relied on the fact that \(\pi ^{(\ell )}_3\) remained unchanged in an A-iteration, since \(\{3\} \in \mathcal {J}_A\) was a composite cell containing only basic cell 3. This is no longer true for general decompositions. To overcome this, we need to replace parts of the primal iterate \(\pi ^{(\ell )}\) by partial transport plans from previous iterations (Lemma 5). When bounding the sub-optimality by the decrement (cf. Remark 2) we also need an approximate triangle inequality for the \({{\,\mathrm{KL}\,}}\)-divergence (Lemma 7).

Lemma 5

(Construction of primal upper bound candidate) For \(\ell > M\), \(\ell \) odd, the measure

$$\begin{aligned} {\tilde{\pi }}&= \sum _{i \in I} \pi _i^{(\ell -D(i))} \end{aligned}$$
(4.14)

satisfies \({\tilde{\pi }} \in \Pi (\mu ,\nu )\) and \({{\,\mathrm{KL}\,}}({\tilde{\pi }}|\pi ^*) \ge {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|\pi ^*)\).

Proof

In the proof we assume that \(\ell \) is odd (i.e. an A-iteration was just completed). The proof for \(\ell \) even works completely analogous by swapping the roles of \(J_A\) and \(J_B\).

We start with the claim that \({\tilde{\pi }} \in \Pi (\mu ,\nu )\). Since \({\tilde{\pi }}\) is a sum of non-negative measures it is non-negative. By construction of the algorithm one always has \(\text {P}_X \pi _i^{(m)}=\mu _i\) for all iterations \(m \in \mathbb {N}_0\) and thus

$$\begin{aligned} \text {P}_X {\tilde{\pi }} = \sum _{i \in I} \text {P}_X \pi _i^{(\ell -D(i))} = \sum _{i \in I} \mu _i=\mu \,. \end{aligned}$$

For \(k \in \{0,\ldots ,M\}\) we introduce the auxiliary measures

$$\begin{aligned} {\tilde{\pi }}^k&= \sum _{i \in I} \pi _i^{(\ell - \min \{D(i),k\})}\,. \end{aligned}$$

Note that \({\tilde{\pi }}^0=\pi ^{(\ell )}\) and \({\tilde{\pi }}^M={\tilde{\pi }}\). Further, one obtains for \(k \in \{1,\ldots ,M\}\)

$$\begin{aligned} {\tilde{\pi }}_i^k - {\tilde{\pi }}_i^{k-1}&= {\left\{ \begin{array}{ll} \pi _i^{(\ell -k)}-\pi _i^{(\ell -k+1)} &{} \text {if } D(i) \ge k,\\ 0 &{} \text {else.} \end{array}\right. } \\ {\tilde{\pi }}^k-{\tilde{\pi }}^{k-1}&= \sum _{\begin{array}{c} i \in I:\\ D(i)\ge k \end{array}} \left( \pi _i^{(\ell -k)}-\pi _i^{(\ell -k+1)}\right) \end{aligned}$$

Assume now that k is odd. Recalling that \(\ell \) is assumed to be odd, then the set \(J^k :=\{i \in I\,:\,D(i) \ge k\}\) is a union of A-cells and the iteration step from \(\pi ^{(\ell -k)}\) to \(\pi ^{(\ell -k+1)}\) is an iteration on A-cells (since \(\ell -k+1\) is odd, recall also Definition 7), i.e., for any \(J \in \mathcal {J}_A\) one has

$$\begin{aligned} \text {P}_Y \pi _J^{(\ell -k)} = \text {P}_Y \pi _J^{(\ell -k+1)}. \end{aligned}$$

Since \(J^k\) is a union of A-cells we find therefore

$$\begin{aligned} \text {P}_Y \sum _{i \in J^k} \pi _i^{(\ell -k)} = \text {P}_Y \sum _{i \in J^k} \pi _i^{(\ell -k+1)}\,. \end{aligned}$$

Using this, we find that \(\text {P}_Y {\tilde{\pi }}^k = \text {P}_Y {\tilde{\pi }}^{k-1}\) for \(k \in \{1,\ldots ,M\}\) and k odd, and we can argue in complete analogy for even k via B-cells. Consequently,

$$\begin{aligned} \text {P}_Y {\tilde{\pi }} = \text {P}_Y {\tilde{\pi }}^M = \text {P}_Y {\tilde{\pi }}^{M-1} = \ldots = \text {P}_Y {\tilde{\pi }}^0 = \text {P}_Y \pi ^{(\ell )} = \nu \end{aligned}$$

and thus \({\tilde{\pi }} \in \Pi (\mu ,\nu )\).

Now we establish the \({{\,\mathrm{KL}\,}}\) bound. Again, assume that k is odd, so that \(J^k\) is a union of A-cells and the iteration step from \(\pi ^{(\ell -k)}\) to \(\pi ^{(\ell -k+1)}\) is an iteration on A-cells. Then for any \(J \in \mathcal {J}_A\) one finds

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\pi _J^{(\ell -k)}|\pi ^*) \ge {{\,\mathrm{KL}\,}}(\pi _J^{(\ell -k+1)}|\pi ^*) \end{aligned}$$

and since \(J^k\) is a union of A-cells one gets that

$$\begin{aligned} {{\,\mathrm{KL}\,}}({\tilde{\pi }}^k|\pi ^*)-{{\,\mathrm{KL}\,}}({\tilde{\pi }}^{k-1}|\pi ^*)&= \sum _{i \in I} \left[ {{\,\mathrm{KL}\,}}({\tilde{\pi }}_i^k|\pi _i^*)-{{\,\mathrm{KL}\,}}({\tilde{\pi }}_i^{k-1}|\pi _i^*) \right] \\&= \sum _{i \in J^k} \left[ {{\,\mathrm{KL}\,}}(\pi _i^{(\ell -k)}|\pi _i^*)-{{\,\mathrm{KL}\,}}(\pi _i^{(\ell -k+1)}|\pi _i^*) \right] \ge 0. \end{aligned}$$

Again, an analogous argument holds when k is even. Consequently one has

$$\begin{aligned} {{\,\mathrm{KL}\,}}({\tilde{\pi }}|\pi ^*) = {{\,\mathrm{KL}\,}}({\tilde{\pi }}^M|\pi ^*) \ge {{\,\mathrm{KL}\,}}({\tilde{\pi }}^{M-1}|\pi ^*) \ge \ldots \ge {{\,\mathrm{KL}\,}}({\tilde{\pi }}^0|\pi ^*) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|\pi ^*).&\end{aligned}$$

\(\square \)

Lemma 6

(Global density bound) Let \(\ell > M\). Then for \(\nu \)-a.e. \(y \in Y\) and all \(i \in I\) one has

$$\begin{aligned} \tfrac{\text {d}\nu ^{(\ell )}_i}{\text {d}\nu }(y) \ge \frac{\mu _{\text {min}}^{M+1}}{N}\exp \left( -2\,(M+1)\,\Vert c\Vert /\varepsilon \right) . \end{aligned}$$
(4.15)

Proof

Let \(J \in \mathcal {J}^{(\ell )}\) such that \(i \in J\). We apply (3.6) for each \(j \in J\) to obtain

$$\begin{aligned} \tfrac{\text {d}\nu ^{(\ell )}_i}{\text {d}\nu }(y) \ge \exp (-2\Vert c\Vert /\varepsilon )\,\mu _{\text {min}}\cdot \max _{j \in J} \tfrac{\text {d}\nu ^{(\ell -1)}_{j}}{\text {d}\nu }(y)\,. \end{aligned}$$

for \(\nu \)-a.e. \(y \in Y\). Applying this bound recursively, the \(\max \) runs over increasingly many cells and after \(M+1\) applications of the bound we find:

$$\begin{aligned} \tfrac{\text {d}\nu ^{(\ell )}_i}{\text {d}\nu }(y) \ge \exp (-2\,(M+1)\,\Vert c\Vert /\varepsilon )\,\mu _{\text {min}}^{M+1} \cdot \max _{j \in I} \tfrac{\text {d}\nu ^{(\ell -M-1)}_{j}}{\text {d}\nu }(y)\,. \end{aligned}$$

Observe now that \(\sum _{j \in I} \tfrac{\text {d}\nu ^{(\ell -M-1)}_{j}}{\text {d}\nu }(y)=1\) and thus the maximum must be bounded from below by 1/N. \(\square \)

Proof

(Theorem 2) Let us set

$$\begin{aligned} {\hat{\pi }} :=({\hat{u}} \otimes {\hat{v}}) \cdot K\quad \text {with} \quad {\hat{u}}_i :=\left( \prod _{k=0}^{D(i)-1} q_{i,k}\right) \cdot u_i^{(\ell -D(i))} \quad \text {and} \quad {\hat{v}}&:=v_{J_0}^{(\ell )} \end{aligned}$$
(4.16)

where the factors \(q_{i,k}\) are defined as

$$\begin{aligned} q_{i,k} :=\fint _X \frac{u^{(\ell -k)}_{n_{i,k}}}{u^{(\ell -k-1)}_{n_{i,k}}} \,\text {d}\mu _{n_{i,k}} \end{aligned}$$
(4.17)

and are finite thanks to the lower and upper boundedness of u-scalings, cf. Lemma 1, ((i),(iii)). Thus, with Lemmas 34 and \({\tilde{\pi }}\) from Lemma 5, one obtains

$$\begin{aligned} \varDelta (\pi ^{(\ell )}) = {{\,\mathrm{KL}\,}}(\pi ^{(\ell )}|\pi ^*)&\le {{\,\mathrm{KL}\,}}({\tilde{\pi }}|\pi ^*) \le {{\,\mathrm{KL}\,}}({\tilde{\pi }}|{\hat{\pi }}) = \sum _{i \in I} {{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i) \end{aligned}$$
(4.18)

where one has

$$\begin{aligned} {{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i)&= \int _{X_i \times Y} \varphi \left( \tfrac{\text {d}{\tilde{\pi }}_i}{\text {d}{\hat{\pi }}_i}\right) \,\text {d}{\hat{\pi }}_i\,. \end{aligned}$$

For \(i \in J_0\) one finds \({\hat{\pi }}_i\) almost everywhere that

$$\begin{aligned} \frac{\text {d}{\tilde{\pi }}_i}{\text {d}{\hat{\pi }}_i} = \frac{{\tilde{u}}_i \otimes {\tilde{v}}_i}{{\hat{u}}_i \otimes {\hat{v}}} = \frac{u_i^{(\ell )} \otimes v_i^{(\ell )}}{u_i^{(\ell )} \otimes v^{(\ell )}_{J_0}}=1 \qquad \text { and thus } \qquad {{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i)=0 \end{aligned}$$

where we have used that \({\hat{u}}_i=u^{(\ell )}_i\) and \(v^{(\ell )}_i=v^{(\ell )}_{J_0}\) when \(i \in J_0\) (which implies \(D(i)=0\)).

To control \({{\,\mathrm{KL}\,}}({\tilde{\pi }}_i|{\hat{\pi }}_i)\) for \(i \in I\) with \(D(i)>0\) we will use the following bound for \(k \in \{0,\ldots ,D(i)-1\}\):

$$\begin{aligned}&{{\,\mathrm{KL}\,}}(\pi ^{(\ell -k-1)}_{n_{i,k}}|\pi ^{(\ell -k)}_{n_{i,k}}) \nonumber \\&\quad = \int _{X \times Y} \varphi \left( \tfrac{u^{(\ell -k-1)}_{n_{i,k}}(x) \cdot v^{(\ell -k-1)}_{n_{i,k}}(y)}{u^{(\ell -k)}_{n_{i,k}}(x) \cdot v^{(\ell -k)}_{n_{i,k}}(y)} \right) u^{(\ell -k)}_{n_{i,k}}(x) \cdot v^{(\ell -k)}_{n_{i,k}}(y) \, k(x,y) \, \text {d}\mu _{n_{i,k}}(x) \text {d}\nu (y) \nonumber \\&\quad \ge e^{-\frac{2\Vert c\Vert }{\varepsilon }} \cdot \int _{X \times Y} \varphi \left( \tfrac{u^{(\ell -k-1)}_{n_{i,k}}(x) \cdot v^{(\ell -k-1)}_{n_{i,k}}(y)}{u^{(\ell -k)}_{n_{i,k}}(x) \cdot v^{(\ell -k)}_{n_{i,k}}(y)} \right) \tfrac{u^{(\ell -k)}_{n_{i,k}}(x)}{u^{(\ell -k-1)}_{n_{i,k}}(x)}u^{(\ell -k-1)}_{n_{i,k}}({\hat{x}}) \cdot v^{(\ell -k)}_{n_{i,k}}(y) \, \text {d}\mu _{n_{i,k}}(x) \text {d}\nu (y) \nonumber \\&\quad = e^{-\frac{2\Vert c\Vert }{\varepsilon }} \cdot \int _{Y} \left[ \int _X \varphi \left( \tfrac{u^{(\ell -k-1)}_{n_{i,k}}(x) \cdot v^{(\ell -k-1)}_{n_{i,k}}(y)}{u^{(\ell -k)}_{n_{i,k}}(x) \cdot v^{(\ell -k)}_{n_{i,k}}(y)} \right) \tfrac{u^{(\ell -k)}_{n_{i,k}}(x)}{u^{(\ell -k-1)}_{n_{i,k}}(x)} \, \text {d}\mu _{n_{i,k}}(x)\right] u^{(\ell -k-1)}_{n_{i,k}}({\hat{x}}) \cdot v^{(\ell -k)}_{n_{i,k}}(y) \,\text {d}\nu (y) \nonumber \\&\quad \ge e^{-\frac{2\Vert c\Vert }{\varepsilon }}\,\Vert \mu _{n_{i,k}}\Vert \int _Y \varphi \left( \tfrac{v^{(\ell -k-1)}_{n_{i,k}}(y)}{q_{i,k}\, v^{(\ell -k)}_{n_{i,k}}(y)} \right) q_{i,k}\,u^{(\ell -k-1)}_{n_{i,k}}({\hat{x}})\, v^{(\ell -k)}_{n_{i,k}}(y) \, \text {d}\nu (y) \end{aligned}$$
(4.19)

for \(\mu _{n_{i,k}}\)-a.e. \({\hat{x}}\in X_{n_{i,k}}\), where in the first inequality we have used (3.3) and the boundedness of c, and in the second inequality we have used Jensen’s inequality as in (4.11) in the proof of Theorem 1.

Now note that one has \(v_{n_{i,k}}^{(\ell -k)}=v_{n_{i,k-1}}^{(\ell -k)}\) for \(k \in \{1,\ldots ,D(i)\}\), since \(\{n_{i,k},n_{i,k-1}\} \in J\) for some \(J \in \mathcal {J}^{(\ell -k)}\), cf. Definition 7 (vi) and Fig. 3. Therefore, for i with \(D(i)>0\) and recalling \(n_{i,D(i)} = i\), one obtains \({\hat{\pi }}_i\)-almost everywhere that

$$\begin{aligned} \frac{\text {d}{\tilde{\pi }}_i}{\text {d}{\hat{\pi }}_i}&= \frac{{\tilde{u}}_i \otimes {\tilde{v}}_i}{{\hat{u}}_i \otimes {\hat{v}}_i} = \frac{u_i^{(\ell -D(i))} \otimes v_i^{(\ell -D(i))}}{ \left( \prod _{k=0}^{D(i)-1} q_{i,k}\right) \cdot u_i^{(\ell -D(i))} \otimes v^{(\ell )}_{J_0}} = \frac{v_{n_{i,D(i)}}^{(\ell -D(i))}}{ \left( \prod _{k=0}^{D(i)-1} q_{i,k}\right) \cdot v^{(\ell )}_{n_{i,0}}} \nonumber \\&= \left( \prod _{k=0}^{D(i)-1} q_{i,k}\right) ^{-1} \cdot \frac{v_{n_{i,D(i)-1}}^{(\ell -D(i))}}{v^{(\ell )}_{n_{i,0}}} \cdot \underbrace{\left( \prod _{k=1}^{D(i)-1} \frac{v_{n_{i,k-1}}^{(\ell -k)}}{v_{n_{i,k}}^{(\ell -k)}} \right) }_{=1} = \prod _{k=0}^{D(i)-1} \frac{v_{n_{i,k}}^{(\ell -k-1)}}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}}\,. \end{aligned}$$
(4.20)

So continuing the suboptimality bound from (4.18) we obtain

$$\begin{aligned} \varDelta (\pi ^{(\ell )})&\le \sum _{\begin{array}{c} i \in I:\\ D(i)>0 \end{array}} \int _{X \times Y} \varphi \left( \prod _{k=0}^{D(i)-1} \frac{v_{n_{i,k}}^{(\ell -k-1)}}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}} \right) \,\text {d}{\hat{\pi }}_i. \end{aligned}$$

In the following we now transform the ‘\(\varphi \) of product’ into a ‘sum of \(\varphi \)’ via Lemma 7. For each term we then use several auxiliary results to bound \({\hat{\pi }}_i\) from below by the appropriate \(\pi ^{(\ell -k)}_i\). Using the definition of \(q_{i,k}\) in (4.17) and (3.5) we find that

$$\begin{aligned} \frac{v_{n_{i,k}}^{(\ell -k-1)}}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}} \le \frac{\exp (3\Vert c\Vert /\varepsilon )}{\mu _{\text {min}}} =:L \ge 1 \end{aligned}$$

and so by using Lemma 7 one obtains for

$$\begin{aligned} C_1&:=2\,D(i) \cdot L^{D(i)-1} \ge D(i) \cdot \max \{2,L^{D(i)-1}\} \end{aligned}$$

that

$$\begin{aligned} \varDelta (\pi ^{(\ell )})&\le C_1 \cdot \sum _{\begin{array}{c} i \in I:\\ D(i)>0 \end{array}} \sum _{k=0}^{D(i)-1} \int _{X \times Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}} \right) \,\text {d}{\hat{\pi }}_i. \end{aligned}$$
(4.21)

Further, since \(i = n_{i,D(i)}\) and \(\{n_{i,D(i)}, n_{i,D(i)-1}\} \subset J\) for some \(J \in \mathcal {J}^{(\ell -D(i))}\) (cf. Definition 7 (vi)), by using (3.3) one finds that for \(\mu _i\)-a.e. \(x \in X_i\)

$$\begin{aligned} q_{i,D(i)-1}\cdot u^{(\ell -D(i))}_i(x)&= \fint _X \frac{u_{n_{i,D(i)}}^{(\ell -D(i))}(x) \cdot u^{(\ell -(D(i)-1))}_{n_{i,D(i)-1}}({\hat{x}})}{u^{(\ell -D(i))}_{n_{i,D(i)-1}}({\hat{x}})} \,\text {d}\mu _{n_{i,D(i)-1}}({\hat{x}}) \\&\le \exp (\Vert c\Vert /\varepsilon ) \fint _X u^{(\ell -(D(i)-1))}_{n_{i,D(i)-1}}({\hat{x}}) \,\text {d}\mu _{n_{i,D(i)-1}}({\hat{x}}) \end{aligned}$$

and therefore, applying iteratively this bound along the chain connecting \(X_{i}\) to \(X_{n_{i,0}}\), we obtain for \(\mu _i\)-a.e. \(x \in X_i\) and \(\mu _{n_{i,0}}\)-a.e. \(x' \in X_{n_{i,0}}\)

$$\begin{aligned} {\hat{u}}_i(x)&= \left[ \prod _{k=0}^{D(i)-1} q_{i,k}\right] \cdot u_i^{(\ell -D(i))}(x) \le e^{\frac{\Vert c\Vert }{\varepsilon }} \cdot \fint _X \left[ \prod _{k=0}^{D(i)-2} q_{i,k} \right] u^{(\ell -D(i)+1)}_{n_{i,D(i)-1}}({\hat{x}}) \,\text {d}\mu _{n_{i,D(i)-1}}({\hat{x}}) \nonumber \\&\le \dots \le \exp (D(i)\,\Vert c\Vert /\varepsilon ) \cdot \fint _{X} u^{(\ell )}_{n_{i,0}}({\hat{x}})\,\text {d}\mu _{n_{i,0}}({\hat{x}}) \le \exp ((D(i)+1)\,\Vert c\Vert /\varepsilon ) \cdot u^{(\ell )}_{n_{i,0}}(x'). \end{aligned}$$
(4.22)

Combining successively the upper bound in (3.4), (4.15) and the lower bound in (3.4) one deduces for \(({\mu }_{n_{i,0}} \otimes {\mu }_{n_{i,k}})\)-a.e. \((x, x') \in X_{n_{i,0}} \times X_{n_{i,k}}\) and \(\nu \)-a.e. \(y \in Y\) that

$$\begin{aligned} u^{(\ell )}_{n_{i,0}}(x) \cdot v^{(\ell )}_{n_{i,0}}(y)&\le \tfrac{\exp \big (2 \Vert c\Vert /\varepsilon \big )}{\mu _{\text {min}}} \le \tfrac{N\exp \big ((2M+4) \Vert c\Vert /\varepsilon \big )}{\mu _{\text {min}}^{M+2}} \tfrac{\text {d}\nu ^{(\ell -k)}_{n_{i,k}}}{\text {d}\nu }(y) \nonumber \\&\le \tfrac{N\exp \big ((2M+5) \Vert c\Vert /\varepsilon \big )}{\mu _{\text {min}}^{M+2}}\,\cdot \,\Vert \mu _{n_{i,k}}\Vert \cdot u^{(\ell -k)}_{n_{i,k}}(x') \cdot v^{(\ell -k)}_{n_{i,k}}(y) \end{aligned}$$
(4.23)

and we set

$$\begin{aligned} C_2 :=\tfrac{N\exp \big ((2M+5) \Vert c\Vert /\varepsilon \big )}{\mu _{\text {min}}^{M+2}}. \end{aligned}$$

Eventually, applying again (3.3), we can easily see that

$$\begin{aligned} u^{(\ell -k)}_{n_{i,k}}(x) \le e^{\frac{2\,\Vert c\Vert }{\varepsilon }} \cdot q_{i,k} \cdot u^{(\ell -k-1)}_{n_{i,k}}(x') \qquad \text {for}~ \mu _{n_{i,k}}~\text {-a.e.}~x, x' \in X_{n_{i,k}} \end{aligned}$$
(4.24)

We now have all the ingredients to go back to (4.21). For every fixed \(i \in I\) with \({\hat{\pi }}_i = {\hat{u}}_i \otimes {\hat{v}}_i \cdot K_i\) and \(0 \le k \le D(i)-1\), and for \(\mu _{n_{i,0}}\)-a.e. \(x' \in X_{n_{i,0}}\) and \(\mu _{n_{i,k}}\)-a.e. \(x'' \in X_{n_{i,k}}\), we compute

$$\begin{aligned}&\int _{X \times Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}} \right) \,\text {d}{\hat{\pi }}_i = \int _{X \times Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}(y)}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}(y)} \right) {\hat{u}}_i(x){\hat{v}}_i(y)\underbrace{k(x,y)}_{\le 1} \,\text {d}{\mu }_i(x) \text {d}\nu (y) \\&{}{\mathop {\le }\limits ^{(4.22)}} \Vert \mu _i\Vert e^{\frac{(M+1)\,\Vert c\Vert }{\varepsilon }}\int _{Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}(y)}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}(y)} \right) \cdot u^{(\ell )}_{n_{i,0}}(x') \cdot v^{(\ell )}_{n_{i,0}}(y) \, \text {d}\nu (y) \\&{\mathop {\le }\limits ^{(4.23)}} C_2 \Vert \mu _i\Vert \cdot \Vert \mu _{n_{i,k}}\Vert e^{\frac{(M+1)\,\Vert c\Vert }{\varepsilon }}\int _{Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}(y)}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}(y)} \right) u^{(\ell -k)}_{n_{i,k}}(x'') \cdot v^{(\ell -k)}_{n_{i,k}}(y) \, \text {d}\nu (y) \\&{\mathop {\le }\limits ^{(4.24)}} C_2 \Vert \mu _i\Vert \cdot \Vert \mu _{n_{i,k}}\Vert e^{\frac{(M+3)\,\Vert c\Vert }{\varepsilon }}\int _{Y} \varphi \left( \tfrac{v_{n_{i,k}}^{(\ell -k-1)}(y)}{q_{i,k}\,v_{n_{i,k}}^{(\ell -k)}(y)} \right) q_{i,k} \cdot u^{(\ell -k-1)}_{n_{i,k}}(x'') \cdot v^{(\ell -k)}_{n_{i,k}}(y) \, \text {d}\nu (y) \\&{\mathop {\le }\limits ^{(4.19)}} C_2 \Vert \mu _i\Vert e^{\frac{(M+5)\,\Vert c\Vert }{\varepsilon }} {{\,\mathrm{KL}\,}}(\pi ^{(\ell -k-1)}_{n_{i,k}}|\pi ^{(\ell -k)}_{n_{i,k}}) \end{aligned}$$

This eventually yields

$$\begin{aligned} \varDelta (\pi ^{(\ell )})&\le C_1 \cdot C_2 \cdot e^{\frac{(M+5)\,\Vert c\Vert }{\varepsilon }} \sum _{\begin{array}{c} i \in I:\\ D(i)>0 \end{array}} \sum _{k=0}^{D(i)-1} \Vert \mu _i\Vert {{\,\mathrm{KL}\,}}\left( \pi ^{(\ell -k-1)}_{n_{i,k}}|\pi ^{(\ell -k)}_{n_{i,k}}\right) \\&\le C_1 \cdot C_2 \cdot e^{\frac{(M+5)\,\Vert c\Vert }{\varepsilon }} \sum _{\begin{array}{c} j \in I:\\ D(j)<M \end{array}} \sum _{i \in I} \Vert \mu _i\Vert {{\,\mathrm{KL}\,}}\left( \pi ^{(\ell -D(j)-1)}_{j}|\pi ^{(\ell -D(j))}_{j}\right) \\&\le C_1 \cdot C_2 \cdot e^{\frac{(M+5)\,\Vert c\Vert }{\varepsilon }} \sum _{\begin{array}{c} j \in I \end{array}} \sum _{k = 0}^{M-1} {{\,\mathrm{KL}\,}}\left( \pi ^{(\ell -k-1)}_{j}|\pi ^{(\ell -k)}_{j}\right) \\&= C_1 \cdot C_2 \cdot e^{\frac{(M+5)\,\Vert c\Vert }{\varepsilon }} \cdot \left( \varDelta \big (\pi ^{(\ell -M)}\big )-\varDelta \big (\pi ^{(\ell )}\big ) \right) \end{aligned}$$

where we used in the second line that \(D(j)=k\) if \(j=n_{i,k}\) for \(j \in I\) (since \(n_{i,k}\) is at the k-th step of the shortest path to cell i). This yields the sought-after bound:

Convergence of the iterates to the optimal solution follows as in Theorem 1. \(\square \)

4.3 Numerical worst-case examples

In the preceding Sections we derived bounds on the convergence rate based on various worst-case estimates. But convergence might actually be much faster. In this Section we provide numerical examples that demonstrate that, while the pre-factors in our bounds might not be tight, in the three-cell-case the qualitative dependency of the convergence rate on the parameters \(\varepsilon \) and the mass in the middle cell \(\mu (X_2)\) are accurate; and that convergence slows down in the general case as the maximum distance M from the root cell \(J_0\) in the partition graph increases.

Fig. 4
figure 4

Setup for worst-case examples. Top row: three cells, Example 2, cost c, initial coupling \(\pi ^{(0)}\) and embedding of \(X=Y=\{1,2,3\}\) into \(\mathbb {R}^2\) such that c becomes the distance function. The domain decomposition algorithm must exchange the masses of the first and third column of \(\pi ^{(0)}\).Bottom row: cost c and initial coupling \(\pi ^{(0)}\) for chain setup, Example 4, for \(N=5\)

Fig. 5
figure 5

Results for \(\varepsilon \)-dependency in three-cell worst-case example, see Example 2. Left: Sub-optimality \(\varDelta (\pi ^{(\ell )})\) over iterations for various values of \(\varepsilon \). The values converge to zero linearly. Right: We extract the linear contraction factor \(\lambda (\varepsilon )\) from the left plot and compare it to the theoretical bound from Theorem 1. This is visualized best in the form \(-\log (\lambda (\varepsilon )^{-1}-1)\) for which the bound yields \(\tfrac{2\Vert c\Vert }{\varepsilon }-\log (q/(1-q))\), see (4.25). In particular, the \(1/\varepsilon \)-behaviour is accurate and the pre-factor appears to be off only by a factor of \(\approx 2\)

Example 2

(Three cells, dependency on \(\varepsilon \)) Let \(X=Y=\{1,2,3\}\) with cost c as given in Fig. 4. This configuration can be obtained with embeddings of X and Y into \(\mathbb {R}^2\). Set \(\mu =\nu =p \cdot \delta _{1} + q \cdot \delta _2 + p \cdot \delta _3\) for \(q \in (0,1)\), \(p=\tfrac{1-q}{2}\). We choose \(q=0.3\). Basic and composite partitions are chosen as in the previous three-cell examples (Example 1): \(X_i=\{i\}\) for \(i \in I=\{1,2,3\}\), \(\mathcal {J}_A=\{\{1,2\},\{3\}\}\), \(\mathcal {J}_B=\{\{1\},\{2,3\}\}\).

The unique optimal coupling in the unregularized case is given by \(\pi ^*=p \cdot \delta _{(1,1)} + q \cdot \delta _{(2,2)} + p \cdot \delta _{(3,3)}\). As initial coupling for the algorithm we pick \(\pi ^{(0)} :=p \cdot \delta _{(1,3)} + q \cdot \delta _{(2,2)} + p \cdot \delta _{(3,1)}\). So the mass in cell \(X_1\) must be moved into cell \(X_3\) and vice versa.

In the unregularized case \(\pi ^{(0)}\) would be a fixed-point of the iterations, since \(\pi ^{(0)}\) is partially optimal on \(X_{\{1,2\}}\) and \(X_{\{2,3\}}\). This provides a simple example that the unregularized domain decomposition algorithm requires additional assumptions for convergence to the global solution.

In the regularized setting, some mass will also be put onto the points (1, 2), (2, 1), (2, 3), and (3, 2) despite the high cost, thus leading to the possibility to move mass between \(X_1\) to \(X_3\) via \(X_2\) and thus eventually solving the problem. But this mass tends to zero exponentially as \(\varepsilon \rightarrow 0\) thus resulting in an exponential number of iterations.

Numerical results are summarized in Figure 5. To estimate \(\varDelta (\pi ^{(\ell )})\) we obtain a high precision approximate solution \(\pi ^*\) by applying a single Sinkhorn algorithm to the problem. Since this can directly exchange mass between cells 1 and 3, for the single solver this problem is not particularly difficult (the same applies to the other examples in this Section). Let \(\lambda (\varepsilon )\) be the empirical contraction factor for the sub-optimality. Theorem 1 yields the bound

$$\begin{aligned} \lambda (\varepsilon ) \le \left( 1+\exp (-2\Vert c\Vert /\varepsilon ) \cdot \tfrac{q}{1-q} \right) ^{-1} \,\, \Leftrightarrow \,\, -\log (\lambda (\varepsilon )^{-1}-1) \le \tfrac{2\Vert c\Vert }{\varepsilon } - \log (q/(1-q)). \end{aligned}$$
(4.25)

As indicated in Fig. 5, the \(1/\varepsilon \)-dependency accurately describes the convergence behaviour.

Fig. 6
figure 6

Results for q-dependency in three-cell worst-case example, see Example 3. Left: Sub-optimality \(\varDelta (\pi ^{(\ell )})\) over iterations for various values of q. The values converge to zero linearly. Right: We extract the linear contraction factor \(\lambda (q)\) from the left plot and compare it to the theoretical bound from Theorem 1. This is visualized best in the form \(\lambda (q)^{-1}-1\) for which the bound yields \(\exp (-2\Vert c\Vert /\varepsilon ) \cdot \frac{q}{1-q}\), see (4.26). In particular, the proportionality to q (for small values) is accurate. The pre-factor appears to be off by a factor of \(\approx 50\)

Example 3

(Three cells, dependency on q) We revisit the previous Example 2, but now we fix the regularization parameter \(\varepsilon =10\) and vary the mass q in the middle cell. Intuitively, far from the optimal solution, the mass exchanged between the cells in each iteration should be proportional to q and thus for small q the number of required iterations should be approximately proportional to \(q^{-1}\). This would imply that for small q the contraction ratio \(\lambda (q)\) scales as \(\exp (-C\,q) \approx 1-C\,q\) for some constant C. For small q this matches the behaviour of the bound of Theorem 1, which implies, cf. (4.25),

$$\begin{aligned} \lambda (q)^{-1}-1 \ge \exp (-2\Vert c\Vert /\varepsilon ) \cdot \frac{q}{1-q}. \end{aligned}$$
(4.26)

Numerical results are summarized in Figure 6. In our setting the proportionality between \(\lambda (q)^{-1}-1\) and q is confirmed numerically. Note that we chose \(\varepsilon =10\) relatively high. For smaller values we observed that the mass exchanged between the cells in each iterations was no longer proportional to q and thus the interplay between the parameters \(\varepsilon \) and q became more complicated.

Fig. 7
figure 7

Results for N-dependency in chain graph worst-case example, see Example 4. Left: Sub-optimality \(\varDelta (\pi ^{(\ell )})\) over iterations for various values of N. The values converge to zero linearly. Right: Linear contraction factors \(\lambda (N)\) extracted from the left plot. Since they approach 1 very quickly, they are visualized best in the form \(1-\lambda (N)\) in log-scale

Example 4

(Chain graph, dependency on graph size) The dependency of the contraction ratio bound of Corollary 2 on the graph structure is more complex. The graph structure determines the numbers M, N and implicitly \(\mu _{\text {min}}\) (since mass must be distributed over increasingly many cells). Due to the variety of worst-case estimates that entered into the proof of (4.13) we do not expect it to be particularly tight. Nevertheless, it is easy to confirm numerically that convergence slows down when the graph size increases.

We extend Example 2. For \(N \ge 3\) let \(X:=Y:=I:=\{1,\ldots ,N\}\), \(X_i:=\{i\}\) for \(i \in I\). Let \(\mu :=\nu :=\tfrac{1}{N} \sum _{i \in I} \delta _i\). Set

$$\begin{aligned} c(i,j) :={\left\{ \begin{array}{ll} 0 &{}\quad \text {if } i=j, \\ 1 &{}\quad \text {if } (i,j) \in \{(1,N),(N,1)\}, \\ 10 &{}\quad \text {else.} \end{array}\right. } \end{aligned}$$

The unique optimal coupling is given by \(\pi ^*=\tfrac{1}{N} \sum _{i=1}^N \delta _{(i,i)}\), we choose \(\pi ^{(0)} = \tfrac{1}{N} \big (\sum _{i=2}^{N-1} \delta _{(i,i)} + \delta _{(1,N)} + \delta _{(N,1)}\big )\). c and \(\pi ^{(0)}\) are illustrated in Fig. 4. For N even, we set

$$\begin{aligned} \mathcal {J}_A:=\{\{1,2\},\{3,4\},\ldots ,\{N-1,N\}\},\quad \mathcal {J}_B:=\{\{1\},\{2,3\},\ldots ,\{N-2,N-1\},\{N\}\}, \end{aligned}$$

for N odd we adapt this in the obvious fashion. Finally, let \(\varepsilon =1.4\) (but the qualitative results do not depend on \(\varepsilon \)). Similar to Example 2, mass from the first and last cell must be exchanged, but this time via an increasing number of intermediate cells. Numerical results for this setting are shown in Fig. 7. We find that \(\lambda (N)\) approaches 1 as N increases.

5 Relation to parallel sorting

The linear rates obtained in Theorems 1 and 2 become very slow for small \(\varepsilon \). The number of required iterations to achieve prescribed sub-optimality scales like \(\exp (C/\varepsilon )\) as \(\varepsilon \) approaches zero. This rate is comparable to the result by Franklin and Lorenz [10] on the convergence of the discrete Sinkhorn algorithm. They show that the intermediate marginal \(\mu ^{(\ell )} = P_X \pi ^{(\ell )}\) in the Sinkhorn algorithm satisfies \(d(\mu ^{(\ell )},\mu ) \le q^\ell \cdot d(\mu ^{(0)},\mu )\) where d denotes Hilbert’s projective metric and \(q \approx 1-4\exp (-\Vert c\Vert /\varepsilon )\) for \(\varepsilon \ll \Vert c\Vert \).

While we do not claim that the bounds in Theorems 1 and 2 are tight, they at least qualitatively capture the right scaling behaviour with respect to \(\varepsilon \), the mass contained in the basic cells, and the (approximate) graph diameter M, as demonstrated with some worst-case examples in Sect. 4.3.

We emphasize however that in Sect. 4 we make virtually no assumptions on the cost function (beyond bounded) and on the structure of the basic and composite partitions (beyond basic cells carrying non-zero mass and a connected partition graph). This freedom is essential for the construction of the examples in Sect. 4.3. Conversely, in practice we are usually not dealing with such restive problems, but often need to solve transport problems on low-dimensional domains, with highly structured cost functions, and it is possible to choose highly structured basic and composite partitions. Based on the intuition from Fig. 1 we expect that the algorithm will perform much better in such settings, which is confirmed by numerical experiments in Sect. 7.

Unfortunately, a convergence analysis that leverages the additional geometric structure to obtain better rates is much more challenging than the worst-case bounds from Sect. 4 and thus beyond the scope of this article.

To provide at least some insight we discuss in this Section an example of discrete, one-dimensional, unregularized optimal transport where the domain decomposition algorithm becomes equivalent to the parallel odd-even transposition sorting algorithm. Let

$$\begin{aligned} X&=\{x_1,\ldots ,x_{N}\},&Y&=\{y_1,\ldots ,y_{N}\}, \end{aligned}$$

for some \(N \in \mathbb {N}\). Let \(\mu \) and \(\nu \) be the normalized counting measures on X and Y, which can be written as

$$\begin{aligned} \mu&= \frac{1}{N} \sum _{i=1}^{N} \delta _{x_i},&\nu&= \frac{1}{N} \sum _{i=1}^{N} \delta _{y_i}. \end{aligned}$$

and let \(c : X \times Y \rightarrow \mathbb {R}_+\) satisfy the strict Monge property [5], i.e. 

$$\begin{aligned} c(x_i,y_j)+c(x_r,y_s)< c(x_i,y_s) + c(x_r,y_s) \qquad \text {for} \qquad 1 \le i< r \le N,\,1 \le j < s \le N.\nonumber \\ \end{aligned}$$
(5.1)

A simple example for such a setting would be when X and Y are (sorted) sets of points in \(\mathbb {R}\), \(x_i < x_j\), \(y_i < y_j\) for \(1 \le i < j \le N\), and \(c(x_i,y_j)=h(x_i-y_j)\) for a strictly convex function h. This covers the Wasserstein distances for \(p>1\).

For this setting we are now interested in the unregularized optimal transport problem

$$\begin{aligned} {\min \left\{ \int _{X \times Y} c\,\text {d}\pi \big \vert \pi \in \Pi (\mu ,\nu ) \right\} .} \end{aligned}$$
(5.2)

It is easy to see that this is equivalent to solving the linear assignment problem between the sets X and Y with respect to the cost c and that the unique optimal solution is given by \(\pi =\tfrac{1}{N} \sum _{i=1}^N \delta _{(x_i,y_i)}\) (uniqueness is implied by strict inequality in (5.1)).

Let now a basic partition of X be given by the partition into singletons, i.e. \(X_i=\{x_i\}\) for \(i \in I = \{1,\ldots ,N\}\) and the two composite partitions \(\mathcal {J}_A\) and \(\mathcal {J}_B\) will be set to

$$\begin{aligned} \mathcal {J}_A&= (\{1,2\},\{3,4\},\ldots ,\{N-1,N\}),&\mathcal {J}_B&= (\{1\},\{2,3\},\ldots ,\{N-2,N-1\},\{N\}) \end{aligned}$$
(5.3)

where for simplicity of exposition we assume that N is even. Further, let \(\pi ^{(0)}=\tfrac{1}{N} \sum _{i=1}^N \delta _{(x_i,y_{\sigma (i)})}\) where \(\sigma : \{1,\ldots ,N\} \rightarrow \{1,\ldots ,N\}\) is some permutation. We can now conceive running an unregularized version of Algorithm 1 where we replace the regularized transport solver in line 8 by the unregularized counterpart. It can quickly be seen that (due to the strict Monge property) the solution of the (now unregularized) sub-problems in line 8 is always unique and at each iteration \(\pi ^{(\ell )}\) corresponds to a permutation \(\sigma ^{(\ell )}\). Each composite cell (with the exception of the cells \(\{1\}\) and \(\{N\}\) in \(\mathcal {J}_B\)) consists of two consecutive elements of X and in line 8 it is tested whether the two corresponding assigned points in Y should be kept in the current order or flipped. Therefore, in this special case the domain decomposition algorithm reduces to the odd-even transposition parallel sorting algorithm [14, Exercise 37] which is known to converge in O(N) iterations. It is not hard to see that the analysis for odd-even transposition sort could be adapted to the continuous one-dimensional strictly-Monge case to establish convergence in \(O(1/\mu _{\text {min}})\) steps.

Unfortunately, the techniques do not generalize to multiple dimensions and an accurate analysis of the convergence speed of the domain decomposition algorithm in that setting is therefore still an open problem. Nevertheless, we are optimistic that the expected performance for ‘structured problems’ in higher dimensions will be much better than the worst-case bounds from Sect. 4, which is confirmed numerically in Sect. 7.

6 Practical large-scale implementation

figure b

6.1 Computational adaptations

Our strongest motivation for studying the entropic domain decomposition algorithm is to develop an efficient numerical method. The statement given in Algorithm 1 is mathematically abstract and served as preparation for theoretical analysis. In this Section we describe a more concrete computational implementation, formalized in Algorithm 2. We summarize the main modifications:

  1. (i)

    Instead of storing the full coupling \(\pi ^{(\ell )}\), we only store the Y-marginal \(\nu ^{(\ell )}_i :=\text {P}_Y \pi ^{(\ell )}_i\) of each basic cell \(i \in I\). This suffices to compute the Y-marginals of \(\pi ^{(\ell )}\) on each composite cell, as done in Algorithm 1, line 7. The partial couplings \(\pi ^{(\ell )}_i\) can quickly be recovered with the scaling factors u, see (iv) below.

  2. (ii)

    To further reduce the memory footprint, we truncate \(\nu ^{(\ell )}_i\) after each iteration, i.e. we only keep a sparse representation of the entries above a small threshold, which we set to \(10^{-15}\). In theory this invalidates the global density bound (4.15) in Lemma 6 and thus jeopardizes convergence of the method. In practice our threshold is low enough such that \(\nu ^{(\ell )}_i\) on neighbouring basic cells have strong overlap and thus, based on the intuition from Sect. 5, on geometric problems we still expect convergence to an approximate solution. This is can be confirmed numerically via the primal-dual gap (Sect. 7.3 and Table 1).

  3. (iii)

    We replace ‘abstract exact solution’ of the entropic transport problems on the composite cells by the Sinkhorn algorithm (compare Algorithm 1, line 8 and Algorithm 2, line 10). This means, we need to be able to handle approximate solutions for the cell problems. We describe how this is done in Sect. 6.2.

  4. (iv)

    We explicitly keep track of the partial scaling factors u on the composite cells of \(\mathcal {J}_A\) and \(\mathcal {J}_B\). This allows accurate initialization of the Sinkhorn algorithm on each composite cell, in particular after the first few iterations. Further, it enables us to quickly reconstruct the partial couplings on composite and basic cells. Finally, we can use them to estimate the sub-optimality via the primal-dual gap, see Sect. 6.3.

  5. (v)

    We combine Algorithm 2 with a multi-scale scheme and the \(\varepsilon \)-scaling heuristic described in [23]. This is critical for computational efficiency, as it allows to obtain a good (and sparse) initial coupling \(\pi ^0\) (or its partial Y-marginals on the basic cells) and drastically reduces the number of required iterations. Some more details are given in Sect. 6.4.

Table 1 Summary of average performance (with standard deviation) of the domain decomposition algorithm and a single sparse Sinkhorn algorithm [23] for comparison

6.2 Sinkhorn solver and handling approximate cell solutions

The Sinkhorn solver at line 10 in Algorithm 2 only returns an approximate solution. This requires some additional post-processing. Note that for efficiency, we initialize the solver with the scaling factor \(u_{C,J}\) from the previous iteration, i.e. we start the solver with a Y-iteration (Remark 1).

As stopping criterion of the Sinkhorn solver we use the \(L^1\)-marginal error of the X-marginal after the Y-iteration. We require that this error is smaller than \(\Vert \mu _J\Vert \cdot \text {Err}\) where \(\text {Err}\) is some global error threshold. In this way, the summed \(L^1\)-error of all X-marginals over all composite cells is bounded by \(\text {Err}\) after each iteration.

We terminate the algorithm after a Y-iteration, i.e. the Y-marginals are satisfied exactly, \(\text {P}_Y \pi _\text {cell}=\nu _\text {cell}\), and after the for-loop in line 14 one has \(\sum _{i \in I} \nu _i = \nu \), which is crucial for the validity of the domain decomposition scheme. Since \(\mu _J = \mu {{\llcorner }}X_J\) is constant throughout the algorithm it is easier to handle a deviation between \(\text {P}_X \pi _\text {cell}\) and \(\mu _J\).

However, when \(\pi _\text {cell}\) is not contained in \(\Pi (\mu _J,\nu _\text {cell})\), after computing the basic cell marginals in line 11 it may occur that \(\Vert \nu _i\Vert \ne \Vert \mu _i\Vert \). This means, that while \(\sum _{i \in J} \nu _i = \nu _\text {cell}\), the masses of the \(\nu _i\) may not exactly match the masses in their basic cells. This would cause problems in subsequent iterations. Therefore, the function \(\textsc {BalanceMeasures}\) determines for each basic cell \(i \in J\) the deviation \(\Vert \nu _i\Vert - \Vert \mu _i\Vert \) and then moves mass between the different \((\nu _i)_{i \in J}\) to compensate the deviations, while preserving their non-negativity and their sum. This can be done efficiently directly in the sparse data structure and without adding new support points.

6.3 Estimating the primal-dual gap

Explicitly evaluating the primal-dual gap between (2.4) and (2.8) (see Proposition 2 (ii)) is a useful tool to verify that a high-quality approximate solution was found. As primal candidate we can use the current primal iterate. We can sum up the contribution of each \(\pi _\text {cell}\) to (2.4) in the for-loop starting at line 8. The partial replacement with older iterates, as in Lemma 5, was merely required for the proof strategy, for numerical evaluation it is not necessary.

As a dual candidate we could use \({\hat{u}}\) and \({\hat{v}}\) as constructed in (4.16, 4.17). It is not hard to see that upon perfect convergence of the domain decomposition algorithm, the functions \(u^{(\ell )}_{J_A}\) and \(u^{(\ell -1)}_{J_B}\) on two overlapping composite cells \(J_A \in \mathcal {J}_A\), \(J_B \in \mathcal {J}_B\) would only differ by a constant factor on their overlap \(J_A \cap J_B \ne \emptyset \). Consequently, this ‘glued’ together \(({\hat{u}},{\hat{v}})\) would be a dual maximizer (cf. proof strategy of Proposition 4). In practice, before perfect convergence, we find however that this construction yields unnecessarily bad dual estimates. We now sketch a slightly more sophisticated construction (along with an explanation why the above construction does not work well numerically).

Inspired by the partition graph, Definition 7, and the construction (4.16, 4.17) we now define another weighted graph. The vertex set is given by the composite cells \(\mathcal {J}_A\). Between two vertices \(J_1, J_2 \in \mathcal {J}_A\) there will be an edge if there exists some \(J_B \in \mathcal {J}_B\) such that \(J_1 \cap J_B \ne \emptyset \) and \(J_2 \cap J_B \ne \emptyset \). The corresponding edge-weight will be set to

$$\begin{aligned} q_{(J_1,J_2)} :=\fint _{J_1 \cap J_B} \frac{u_{A,J_1}}{u_{B,J_B}}\,\text {d}\mu \cdot \fint _{J_2 \cap J_B} \frac{u_{B,J_B}}{u_{A,J_2}}\,\text {d}\mu . \end{aligned}$$

Note that the weight depends on the orientation of the edge and that there may be multiple edges between the same two vertices. Similar to (4.16, 4.17) we could now fix a root cell \(J_0 \in \mathcal {J}_A\) and for any \(J \in \mathcal {J}_A\) we fix some path \((J_0,J_1,\ldots ,J_n=J)\) and set

$$\begin{aligned} {\hat{u}}_{J} :=\prod _{k=0}^{n-1} q_{(J_k,J_{k+1})} \cdot u_{A,J}. \end{aligned}$$

Again, after perfect convergence, the \({\hat{u}}\) obtained by gluing the \({\hat{u}}_J\) would be a dual maximal X-scaling. The value of \({\hat{u}}_J\) would not depend on the choice of the path from \(J_0\) to J and when going around a cycle in the graph, the product of the corresponding weight factors will be equal to 1. This means that \(\log q_{(J_1,J_2)}\) is a discrete gradient on the graph, i.e. summing \(\log q_{(J_1,J_2)}\) around a cycle yields zero.

But before perfect convergence, the weights are inconsistent in the sense that \(\log q_{(J_1,J_2)}\) is not a gradient and by picking some arbitrary path from \(J_0\) to J we distribute this inconsistency in an arbitrary way over \({\hat{u}}_J\). In practice this leads to poor candidates \({\hat{u}}\), in particular far from the root cell.

We weaken this effect by doing a discrete Helmholtz decomposition of \(\log q_{(J_1,J_2)}\). More concretely, we look for a minimizer of

$$\begin{aligned} {\min \left\{ \sum _{\begin{array}{c} J_1,J_2 \in \mathcal {J}_A:\\ (J_1,J_2) \text { is graph edge} \end{array}} \left( (V(J_1)-V(J_2)) - \log q_{(J_1,J_2)} \right) ^2 \big \vert V : \mathcal {J}_A\rightarrow \mathbb {R}\right\} .} \end{aligned}$$

The gradient of the minimizing potential V will be an optimal approximation of the edge weights in a mean-square sense. This problem can be solved quickly via a linear system, also when there are multiple edges between two vertices. We then set \({\hat{u}}_J :=\exp (V_J) \cdot u_{A,J}\). We observe that this distributes the inconsistencies in the weights more evenly over the graph and yields better dual estimates.

Since we truncate the partial Y-marginals (see Sect. 6.1 (ii)), unlike in the theoretical analysis the factors \(v_J\) are not defined on all of Y. But once the factors V are available, the \(v_J\) can be ‘glued’ together similar to \({\hat{u}}\).

In practice this method yields small primal-dual gaps (cf. Sect. 7.3), confirming that the entropic domain decomposition works well.

6.4 Multi-scale scheme and \(\varepsilon \)-scaling

In [23] it is described in detail how a combination of a multi-scale scheme and \(\varepsilon \)-scaling increases the computational efficiency of the Sinkhorn algorithm and how to implement them numerically. Similarly, the two techniques are crucial for computational efficiency of the domain decomposition scheme. For instance, to initialize Algorithms 1 and 2 one requires a feasible coupling \(\pi ^0 \in \Pi (\mu ,\nu )\) (or its partial Y-marginals on basic cells). Without further prior knowledge, the only canonical candidate would be the product measure \(\mu \otimes \nu \). But this is almost certainly far from the optimal coupling, hence requiring many iterations, and it is dense, i.e. requiring a lot of memory in the initial iterations. Instead, as usual, we start by solving the problem on a coarse scale and then refine the (approximate) solutions to subsequent finer scales for initialization.

We now provide a few more details on how this is done. Assume X and Y are discrete, finite sets. Let \({\hat{X}}\) and \({\hat{Y}}\) be coarse approximations of X and Y, let \(\text {pa}: X \sqcup Y \rightarrow {\hat{X}} \sqcup {\hat{Y}}\) be a function that assigns to every element of X and Y their parent element in \({\hat{X}}\) and \({\hat{Y}}\). Coarse approximations of \(\mu \) and \(\nu \) are then given by \({\hat{\mu }} :=\text {pa}_\sharp \mu \) and \({\hat{\nu }} :=\text {pa}_\sharp \nu \) where \(\text {pa}_\sharp \) denotes the push-forward of measures. Let \(\{{\hat{X}}_i\}_{i \in {\hat{I}}}\) be a basic partition of \(({\hat{X}},{\hat{\mu }})\) and let \(\{X_i\}_{i \in I}\) be a basic partition of \((X,\mu )\). We choose them in such a way that the fine basic partition can be interpreted as a refinement of the coarse basic partition, i.e. for each \(i \in I\) there is some unique \(j \in {\hat{I}}\) such that \(\text {pa}(x) \in {\hat{X}}_{j}\) for all \(x \in X_i\). We denote the function \(I \rightarrow {\hat{I}}\), taking i to j by \(\text {Pa}\).

Let now \(({\hat{\nu }}_j)_{j \in {\hat{I}}}\) be the tuple of partial marginals as returned by Algorithm 2 on the coarse scale. We use as initialization on the subsequent finer scale the measures defined via

$$\begin{aligned} \nu _i^0(y) :=\nu (y) \cdot \frac{{\hat{\nu }}_{\text {Pa}(i)}(\text {pa}(y))}{{\hat{\nu }}(\text {pa}(y))} \cdot \frac{\mu (X_i)}{{\hat{\mu }}({\hat{X}}_{\text {Pa}(i)})} \end{aligned}$$
(6.1)

where by a slight abuse of notation, we write \(\nu (y)\) to mean \(\nu (\{y\})\) for singletons.

Proposition 5

(6.1) provides a valid initialization for Algorithm 2.

Proof

We need to show that \(\sum _{i \in I} \nu ^0_i=\nu \) and \(\Vert \nu ^0_i\Vert =\Vert \mu _i\Vert \) for all \(i \in I\). For the former, for \(y \in Y\) we find

$$\begin{aligned} \sum _{i \in I} \nu _i^0(y)&= \sum _{j \in {\hat{I}}} \sum _{\begin{array}{c} i \in I:\\ \text {Pa}(i)=j \end{array}} \nu _i^0(y) = \nu (y) \sum _{j \in {\hat{I}}} \frac{{\hat{\nu }}_j(\text {pa}(y))}{{\hat{\nu }}(\text {pa}(y))} \sum _{\begin{array}{c} i \in I:\\ \text {Pa}(i)=j \end{array}} \frac{\mu (X_i)}{{\hat{\mu }}({\hat{X}}_j)}=\nu (y) \end{aligned}$$

where in the third expression we first see that the third factor equals 1, since \(\text {pa}^{-1}({\hat{X}}_j)=\cup _{i \in I:\text {Pa}(i)=j} X_i\), and subsequently the second factor equals 1, since \(\sum _{j \in {\hat{I}}} {\hat{\nu }}_j={\hat{\nu }}\), which is implied by the domain decomposition algorithm on the coarse scale.

Similarly, for the latter, for \(i \in I\),

$$\begin{aligned} \Vert \nu _i^0\Vert&=\sum _{{\hat{y}} \in {\hat{Y}}} \sum _{\begin{array}{c} y \in Y:\\ \text {pa}(y)={\hat{y}} \end{array}} \nu _i^0(y) = \frac{\mu (X_i)}{{\hat{\mu }}({\hat{X}}_{\text {Pa}(i)})} \sum _{{\hat{y}} \in {\hat{Y}}} \frac{{\hat{\nu }}_{\text {Pa}(i)}({\hat{y}})}{{\hat{\nu }}({\hat{y}})} \underbrace{\sum _{\begin{array}{c} y \in Y:\\ \text {pa}(y)={\hat{y}} \end{array}} \nu (y)}_{={\hat{\nu }}({\hat{y}})} \\&= \frac{\mu (X_i)}{{\hat{\mu }}({\hat{X}}_{\text {Pa}(i)})} \sum _{{\hat{y}} \in {\hat{Y}}} {\hat{\nu }}_{\text {Pa}(i)}({\hat{y}}) = \frac{\mu (X_i)}{{\hat{\mu }}({\hat{X}}_{\text {Pa}(i)})} {\hat{\nu }}_{\text {Pa}(i)}({\hat{Y}})=\mu (X_i) \end{aligned}$$

where we have used the consistency condition \(\Vert {\hat{\nu }}_j\Vert =\Vert {\hat{\mu }}_j\Vert ={\hat{\mu }}({\hat{X}}_j)\) for \(j \in {\hat{I}}\) in the coarse domain decomposition algorithm. \(\square \)

Of course this scheme can be repeated over multiple successive levels of approximations.

In addition, we refine the scaling factors u from the coarse to the fine level to obtain good initializations. For this, we first ‘glue’ together the final u at the coarse level, as described in Sect. 6.3. Then we do a linear interpolation onto the fine points (which we perform in \(\log \)-space).

Finally, we remark that when changing the parameter \(\varepsilon \) during \(\varepsilon \)-scaling from \(\varepsilon _{\text {old}}\) to \(\varepsilon _{\text {new}}\), the scalings (uv) should not be left constant, but instead \((\varepsilon \cdot \log (u), \varepsilon \cdot \log (v))\) should be kept constant, so we set \(u_{\text {new}} :=\exp (\tfrac{\varepsilon _{\text {old}}}{\varepsilon _{\text {new}}} \cdot \log u_{\text {old}})\), and similarly for \(v_\text {new}\), cf. [23, Section 3.2].

7 Numerical geometric large scale examples

We now demonstrate the efficiency of Algorithm 2 on large-scale geometric problems and show that it compares favorably to the single Sinkhorn algorithm.

7.1 Preliminaries

Hardware and implementation All numerical experiments were performed on a standard desktop computer with an Intel Core i7-8700 CPU with 6 physical cores and 32 GB of RAM, running a standard Ubuntu-based distribution. We developed an experimental implementation for Python 3.7. Parallelization was implemented with MPI, via the module mpi4py.Footnote 1 We used a simple master/worker setup with one thread (master) keeping track of the overall problem and submitting the sub-tasks to the other threads (workers). Parallelization was used for solving of the composite cell problems, the BalanceMeasures function and data refinement from coarse to fine layers.

Stopping criterion and measure truncation We set the stopping criterion for the internal Sinkhorn solver to \(\text {Err}=10^{-4}\) (see Sect. 6.2). This bounds the \(L^1\)-marginal error of the primal iterates. Partial marginals \((\nu _i)_{i \in I}\) are truncated at \(10^{-15}\) and stored as sparse vectors.

Test data We focus on solving the Wasserstein-2 optimal transport problem, i.e. we set \(c(x,y)=\Vert x-y\Vert ^2\) but the scheme can be applied to arbitrary costs and we expect efficient performance for any cost of the form \(c(x,y)=h(x-y)\) for strictly convex h, such that (in the continuous, unregularized setting) the unique optimal plan is concentrated on the graph of a Monge map, see [11].

Fig. 8
figure 8

Two example images for size \(256 \times 256\), at different layers, \(l=8\) being the original. Yellow represents high, blue low mass density. The combination of strong local mass concentrations, smooth regions and areas with very little mass leads to challenging test problems with non-trivial transport maps (color figure online)

As test data we use 2D images with dimensions \(2^n \times 2^n\) for \(n=6\) to \(n=11\), i.e. from \(64 \times 64\) up to \(2048 \times 2048\), the number of pixels per image ranging from 4.1E3 to 4.2E6. Similar to [22, 23] the images were randomly generated, consisting of mixtures of Gaussians with random variances and magnitudes. This represents challenging problem data, leading to non-trivial optimal couplings, involving strong compression, expansion and anisotropies, as visualized in Figures 8 and 11 . For each problem size we generated 10 test images, i.e. 45 pairwise non-trivial transport problems, to get some estimate of the average performance.

Multi-scale and \(\varvec{\varepsilon }\)-scaling Each image is then represented by a hierarchy of increasingly finer images of size \(2^l \times 2^l\) where l ranges from 3 to n. We refer to l as the layer. This induces a simple quad-tree structure over the images where each pixel at layer l is the parent of four pixels at layer \(l+1\), see Sect. 6.4 and [22, 23] for more details.

We embed the pixels into \(\mathbb {R}^2\) as a Cartesian grid of edge length \(\varDelta x_n :=1\) between two neighbouring pixels at layer \(l=n\) and with edge length \(\varDelta x_l :=2^{n-l}\) on coarser layers. At each layer l we start solving with \(\varepsilon =2 \cdot \varDelta x_l^2\), for which we apply four iterations (two on \(\mathcal {J}_A\) and two on \(\mathcal {J}_B\)). Then we decrease \(\varepsilon \) by a factor of two and apply two more iterations (one on \(\mathcal {J}_A\) and one on \(\mathcal {J}_B\)). We repeat this (so that \(\varepsilon =0.5 \cdot \varDelta x_l^2\)) and then refine the image. This leaves the scale of \(c(\cdot ,\cdot )/\varepsilon \) approximately invariant throughout the layers (cf. [23]). At the finest layer, we perform two additional iterations at \(\varepsilon =0.25\), which implies that the influence of entropic regularization in our final solution is rather weak (cf. Fig. 10).

This means, that for two images of size \(2^n \times 2^n\) we only perform \((n-2) \cdot 8 + 2\) iterations of the domain decomposition algorithm, even though there are \(2^{2n-4}\) basic cells in the X-image (see below) and the graph diameter (Definition 7) is on the order \(O(2^{n-2})\). We observe that this logarithmic number of iterations is fully sufficient (cf. Table 1). This is only possible due to the geometric problem structure (Sect. 5) and the multi-scale scheme with \(\varepsilon \) scaling.

Basic and composite partitions As basic partition we divide each image (at a given layer l) into blocks of \(s \times s\) pixels where the cell size s is a suitable divisor of \(2^l\). The composite partition \(\mathcal {J}_A\) is then generated by summarizing the basic cells into \(2 \times 2\) blocks, the partition \(\mathcal {J}_B\) is generated in the same way but with an offset of one basic cell in each direction, i.e. with single basic cells in the corners and \(2 \times 1\) blocks along the image boundaries. This is visualized in Fig. 2. Each basic cell at layer l is therefore split into four basic cells at layer \(l+1\), thus satisfying the assumption from Sect. 6.4.

The cell size s is an important parameter. For small s the composite cell problems are small (and thus easier), but one must solve a larger number of composite problems, implying more communication overhead in parallelization and requiring more domain decomposition iterations. For large s the number of composite problems is smaller, thus inducing less communication overhead and requiring less iterations, but solving the composite problems becomes more challenging. We found that \(s=4\) yielded the best results for our code. Studying the influence of this parameter in more detail and reducing the computational overhead in the implementation are a relevant direction for future work.

7.2 Visualization of primal iterates

To get an impression of the algorithm’s behaviour we visualize the evolution of the primal iterates \(\pi ^{(\ell )}\). This is difficult since in our examples the iterates are measures on \(\mathbb {R}^4\). But the partition structure provides some assistance. We assign colors to the basic cells in an alternating pattern (e.g. a checkerboard tiling). Denote by \(\text {col}_i\) the color assigned to basic cell \(i \in I\) (\(\text {col}_i\) could just be a RGB vector). We can then approximately visualize a coupling \(\pi \in \Pi (\mu ,\nu )\) with partial marginals \(\nu _i :=\text {P}_Y \pi _i\) via the color image on Y generated by \(\sum _{i \in I} \text {col}_i \cdot \tfrac{\text {d}\nu _i}{\text {d}\nu }\). When \(\pi \) is a deterministic coupling, e.g. introduced by a Monge map \(T : X \rightarrow Y\), then region \(T(X_i)\) will be colored with \(\text {col}_i\), i.e. the image would look like a Cartesian grid that was morphed by the map T. When \(\pi \) is non-deterministic, different \(\nu _i\) can overlap and in these regions the resulting color will be a linear interpolation of the colors \(\text {col}_i\), weighted by the \(\nu _i\).

Fig. 9
figure 9

Primal iterates for the images from Fig. 8, visualized as described in Sect. 7.2, on layer \(l=8\), \(\varepsilon =2\), for \(\ell =0,1,2\). The initial coupling is blurry due to the refinement construction via (6.1). For iterates after \(\ell =2\) it is hard to spot further changes

Fig. 10
figure 10

Primal iterates for the images from Fig. 8, visualized as described in Sect. 7.2, on layer \(l=8\), for \(\varepsilon =2\) (\(\ell =4\) on that layer) and \(\varepsilon =0.25\) (\(\ell =10\)), see Sect. 7.1 for a description of \(\varepsilon \)-scaling and iteration schedule. For \(\varepsilon =2\) some blur between neighbouring cells is visible, for \(\varepsilon =0.25\) it is virtually gone (up to some inevitable overlap due to the discretization)

Fig. 11
figure 11

Primal iterates for the images from Fig. 8, visualized as described in Sect. 7.2, on layers \(l=3\) to \(l=8\), for \(\varepsilon =0.5 \cdot \varDelta x_l\), where \(\varDelta x_l=2^{l-8}\) is the distance between adjacent grid points on a given layer, and \(\ell =8\) on each layer, except on the finest layer, where \(\varepsilon =0.25\) and \(\ell =10\). See Sect. 7.1 for a description of \(\varepsilon \)-scaling and iteration schedule. During each refinement a basic cell is split into four smaller basic cells and additional details in the optimal coupling become discernible

Consider the optimal transport problem between the two images from Fig. 8 with size \(256 \times 256\). Various aspects of this problem are illustrated in Figs. 910 and 11.

Figure 9 shows the effect of the first two iterations on the finest layer \(l=8\). The initial image is essentially uniform, since the initial marginals \(\nu _i^0\), as generated with (6.1) on basic cells \(i \in I\) that were generated by sub-dividing the same coarse basic cell \(j \in {\hat{I}}\) are, up to a constant factor, identical. After the first iteration (on \(\mathcal {J}_A\)) a grid pattern reemerges, after the second iteration (on \(\mathcal {J}_B\)), it is hard to spot further changes (cf. Fig. 10, where for \(\varepsilon =2\) the iterate after four iterations is shown).

The effect of \(\varepsilon \)-scaling within a given layer is visualized in Fig. 10. For \(\varepsilon =2\) the checkerboard pattern is blurred between neighbouring cells. For \(\varepsilon =0.25\), neighbouring cells overlap by at most one pixel, which would also be true in the discrete unregularized setting (recall that our pixels do not all carry the same mass, so the discrete unregularized optimal coupling is usually not deterministic). This demonstrates that the algorithm can be run until very low regularization parameters.

Finally, Fig. 11 demonstrates the evolution of the multi-scale scheme. It shows the final coupling on each layer for the smallest \(\varepsilon \) used on that layer, just before refinement (except for the finest layer). One can trace how increasingly finer structures of the optimal coupling emerge as the resolution improves.

7.3 Quantitative evaluation

Fig. 12
figure 12

Average runtimes of domain decomposition algorithm. The runtime increases approximately linearly (or slightly super-linearly) with the size of the marginal images, allowing the solution of large problems. The runtime of a single Sinkhorn algorithm (with different stopping criterion) is shown for comparison (see Sect. 7.4 for details). Runtime decreases approximately in inverse proportionality with the number of worker threads (up to five workers), indicating efficient parallelization (except for small images where overhead is more significant and thus the effect of parallelization saturates earlier)

Fig. 13
figure 13

Number of non-zero entries to store the partial marginal list \((\nu _i)_{i \in I}\) at the finest layer. Maximal number (for largest value of \(\varepsilon \)) and final number are shown. For comparison the number of non-zero entries in the truncated (stabilized) kernel matrix of the single Sinkhorn algorithm is shown (again, maximal and final number on finest layer). All numbers scale linearly in the image size, indicating that the truncation schemes efficiently reduce memory demand. For domain decomposition the final number of entries per pixel is approximately four, i.e. the coupling is rather concentrated and the effect of entropic blurring is weak. Maximal and final entry numbers are significantly lower with domain decomposition, the maximal number (which represents the memory bottleneck) is almost reduced by a factor 10

Having convinced ourselves by means of visualizations that the entropic domain decomposition algorithm seems to work as intended, we now turn to a more quantitative analysis of its performance. A comparison to the performance of a single sparse multi-scale Sinkhorn algorithm [23] is provided in Sect. 7.4.

Runtime We measure the runtime of the algorithm on different problem sizes, with the number of worker threads ranging from 1 to 5. For simplicity, for a single worker thread, the implementation still uses two threads (one master, one worker), but the performance cannot be distinguished from a true single-thread implementation, since the master thread is essentially idle in this case. In addition we are interested in the time required for solving the final (i.e. finest) layer, and for the times required by the main sub-routines, i.e. the Sinkhorn solver, the measure balancing, truncation and refinement. These are listed in Table 1 and the total runtimes are visualized in Figure 12.

We find that the runtime scales approximately linearly (or at most slightly super-linear) with the size of the marginals which is crucial for solving large problems. The time required for solving the final layer comprises approximately 80% of the total time, indicating that even with a good initialization, solving the finest layer is a challenging problem. The time spent on the Sinkhorn sub-solver for the composite cell problems accounts for approximately 96% of the runtime for large problems, which means that other functions cause relatively little overhead. On small problems these ratios are a bit lower, since the overhead of the multi-scale scheme is more significant.

For large problems the runtime is approximately proportional to the inverse number of workers (up to five workers), indicating that parallelization is computationally efficient. Again, for small problems, computational overhead becomes more significant, such that the benefit of parallelization saturates earlier. For large problems we estimate that MPI communication accounts for approximately 3% of time in a domain decomposition iteration.

Sparsity In addition to runtime the memory footprint of the algorithm is an important criterion for practicability. It is dominated by storing the list \((\nu _i)_{i \in I}\) (after truncation at \(10^{-15}\), see Sect. 6.1), since only a few small sub-problems are solved simultaneously at any given time. We are therefore interested in the number of non-zero entries in this list throughout the algorithm. This number is usually maximal on the finest layer (most pixels) and for the highest value of \(\varepsilon \) on that layer (with strongest entropic blur), which therefore constitutes the memory bottleneck of the algorithm.

This maximal number and the final number of entries are listed in Table 1 and visualized in Fig. 13. We find that both scale approximately linear with marginal size. Again, this is a prerequisite for solving large problems. Maximally the algorithm requires approximately 20 entries per pixel in the first image, and approximately 4 per pixel upon convergence. The latter indicates that entropic blur is indeed strongly reduced in the final iterate.

Solution quality The data reported in Table 1 also confirms that the algorithm provides high-quality approximate solutions of the problem. The relative primal-dual (PD) gap is defined as (see (2.9) for the definition of J)

$$\begin{aligned} \frac{{{\,\mathrm{KL}\,}}(\pi |K)-J(u,v)}{{{\,\mathrm{KL}\,}}(\pi |K)-\Vert K\Vert }, \end{aligned}$$
(7.1)

i.e. we divide the primal-dual gap by the primal score (where, for simplicity we subtract the constant term \(\Vert K\Vert \)). The dual variables (uv) are obtained from the composite duals by gluing, as discussed in Sect. 6.3. We find that this is on the order of \(10^{-4}\), i.e. a high relative accuracy.

The \(L^1\)-error of the X-marginal of the primal iterate is consistently even smaller than the prescribed bound \(\text {Err}=10^{-4}\) (cf. Sect. 6.2). This happens since we do not check the stopping criterion for the Sinkhorn sub-solver after every iteration. The \(L^1\)-error of the Y-marginal is not (numerically) zero, since a small error accumulates over time during the several additions and subtractions of the partial marginals and in the BalanceMeasures function. But since it is still substantially below the X-error we do not consider this to be a problem. Additional stabilization-routines could be added to the algorithm, should it become necessary.

7.4 Comparison to single Sinkhorn algorithm

As a reference we compare the performance of the domain decomposition algorithm to that of a single Sinkhorn algorithm (with adaptive kernel truncation, coarse-to-fine scheme and \(\varepsilon \)-scaling as in [23]). The results are also summarized in Table 1, and visualized in Figs. 12 and 13 . The single Sinkhorn algorithm was limited to sizes up to \(512 \times 512\) due to numerical stability and memory issues (see below for details).

Both algorithms have several parameters that influence the trade-off between accuracy and runtime. One important parameter is the stopping criterion for the sub-solver Sinkhorn algorithm in the former, and for the global single Sinkhorn algorithm in the latter. Another one is the truncation threshold for partial marginals (cf. Sect. 6.1) in the former, and for the (stabilized) kernel matrix in the latter [23, Section 3.3].

Stopping criterion The stopping criterion for the domain decomposition solver was picked such that the global \(L^1\)-error of the marginals is below \(\text {Err}=10^{-4}\) (cf. Sections 6.2 and 7.1 ). We observe that if we pick the same criterion for the single Sinkhorn solver, it takes much longer than the domain decomposition method (even without parallelization) and the runtime scales significantly super-linearly with marginal size.

A more detailed analysis reveals that the single Sinkhorn algorithm spends a major part of time on the largest \(\varepsilon \) on the finest layer, i.e. immediately after the last refinement. We conjecture that this is due to the fact that the single Sinkhorn algorithm represents the primal iterates only implicitly through the dual iterates (uv) via \((u \otimes v) \cdot K\) (cf. Sect. 2.2). During layer-refinement only the dual variables are refined explicitly. This can introduce quite non-local marginal errors in the primal iterate (i.e. mass must be moved far to correct them) that take many iterations to correct. We tried various heuristics to reduce this effect (e.g. more sophisticated interpolation of the dual variables during refinement) but unfortunately without success.

In contrast, the domain decomposition method keeps explicitly track of primal and dual (partial) iterates, both of which are refined separately during layer transitions. In particular the total mass of the refined primal iterate within each refined basic cell is correct after refinement (cf. (6.1) and Proposition 5) and thus fewer global mass movements are required for correcting these errors.

Therefore, as in [23], numerically we use the \(L^\infty \) error as stopping criterion for the single Sinkhorn algorithm. We set the stopping accuracy to \(10^{-7}\), such that (without parallelization) both algorithms have approximately the same runtime.

Sparsity To perform its iterations the single Sinkhorn algorithm must keep track of a truncated (stabilized) kernel matrix, i.e. one column of non-zero entries for each pixel of the first marginal. Following [23, Section 3.3, Equation (3.8)] we keep entries of the (stabilized) kernel matrix where \((u \otimes v) \cdot k\ge \theta \) with the choice \(\theta =10^{-10}\).

The domain decomposition method (in the form of Algorithm 2) only needs to store a column of non-zero entries for each basic cell (of size \(s \times s\)) in the first marginal. Thus, to represent the primal iterate with comparable accuracy, the number of non-zero entries required in the domain decomposition method is reduced by a factor \(s^{-2}\).

In this sense, for the parameter choices in our experiments, the domain decomposition algorithm represents the primal iterates slightly more accurately. At the same time, the number of entries to store the final iterate is reduced to approximately 15% compared to the single Sinkhorn algorithm. The maximum number of entries is even reduced to approximately 11%. On problem size \(2048 \times 2048\) the single Sinkhorn algorithm ran out of memory on our test machine.

Solution quality As discussed above, to obtain similar runtimes for both algorithms, a less strict stopping criterion had to be chosen for the single Sinkhorn algorithm. Therefore, it produces higher \(L^1\)-marginal errors for the X-marginal that increase with problem size. The Y-marginal error is (numerically) zero, since the algorithm was terminated after a Y-iteration (cf. Remark 1). The relative dual score reported in Table 1 is defined by

$$\begin{aligned} \frac{J(u_{\text {single}},v_{\text {single}})-J(u_{\text {domdec}},v_{\text {domdec}})}{J(u_{\text {domdec}},v_{\text {domdec}})-\Vert K\Vert }, \end{aligned}$$
(7.2)

cf. (7.1), where \((u_{\text {domdec}},v_{\text {domdec}})\) refers to the (glued) final dual variables of the domain decomposition method (Sect. 6.3) and \((u_{\text {single}},v_{\text {single}})\) to the final dual iterates of the single Sinkhorn algorithm. The fact that this number is negative indicates that the domain decomposition provides better dual candidates than the single Sinkhorn.

Numerical stability For small regularization the standard Sinkhorn algorithm becomes numerically unstable due to the finite precision of discretized numbers on computers. This can be remedied to a large extent by the absorption technique [23]. But refinement between layers remains a delicate point since the interpolated dual variables from the coarser layer may not provide a perfect initial guess at the finer layer. This can lead to numerically zero or empty columns and rows in the (stabilized) kernel matrix and thus will lead to invalid entries in the scalings (uv). While it is possible to fix these cases, on large problems this is relatively cumbersome and computationally inefficient. We therefore decided to abort the algorithm in these cases instead. In our experiments this concerned 3 of the 45 examples of size \(512 \times 512\) and the majority of the \(1024 \times 1024\) examples (which we have therefore removed from the comparison).

These issues may also arise in the domain decomposition algorithm but there they are localized to small problems on composite cells which are easy to fix (e.g. by temporarily increasing \(\varepsilon \)). We have implemented a corresponding safe-guard. In all our examples it was invoked on two \(1024 \times 1024\) instances.

Summary Clearly the domain decomposition method is superior to the single Sinkhorn algorithm in various aspects. In comparison, the domain decomposition method (without parallelization) produces more accurate primal and dual iterates in comparable time, with approximately 11% of the memory. It runs more reliably on large problems, since numerical hickups of the Sinkhorn algorithm can be fixed locally. On top of that the domain decomposition method allows for real non-local parallelization, whereas the Sinkhorn algorithm can only be parallelized very locally over the matrix-vector multiplications.

8 Conclusion and outlook

In this article we studied Benamou’s domain decomposition algorithm for optimal transport in the entropy regularized setting. The key observation is that the regularized version converges to the unique globally optimal solution under very mild assumptions. We proved linear convergence of the algorithm w.r.t the KL-divergence and illustrated the (potentially very slow) rates with numerical examples.

We then argued that on ‘geometric problems’ the algorithm should converge much faster. To confirm this experimentally, we discussed a practical, efficient version of the algorithm with reduced memory footprint, numerical safeguards for approximation errors, primal-dual certificates and a coarse-to-fine scheme. With this we were able to compute optimal transport between images with \(\approx \) 4 megapixels on a standard desktop computer, matching two megapixel images in approximately four minutes. Even without parallelization, the algorithm compared favourably to a single Sinkhorn algorithm in terms of runtime, memory, solution quality and numerical reliability. With parallelization the runtime was efficiently reduced with little communication overhead.

A practical open question for future research is the development of more sophisticated and efficient implementations, e.g. using GPUs and/or multiple machines, with more sophisticated parallelization structure (e.g. no single unit needs to keep track of the full problem). This should then be tested on 3D problems and with more general cost functions. In the course of this, parameters such as the basic cell size should be studied in more detail.

On the theoretical side the convergence analysis on ‘geometric problems’ is an interesting challenge, to provide a more thorough understanding for the low number of required iterations. We conjecture that a ‘discrete Helmholtz decomposition’ as discussed in Sect. 6.3 may become relevant for this. The relation between optimal transport and the Helmholtz decomposition was already noted in [4] and exploited numerically in [12].