Domain decomposition for entropy regularized optimal transport

We study Benamou's domain decomposition algorithm for optimal transport in the entropy regularized setting. The key observation is that the regularized variant converges to the globally optimal solution under very mild assumptions. We prove linear convergence of the algorithm with respect to the Kullback--Leibler divergence and illustrate the (potentially very slow) rates with numerical examples. On problems with sufficient geometric structure (such as Wasserstein distances between images) we expect much faster convergence. We then discuss important aspects of a computationally efficient implementation, such as adaptive sparsity, a coarse-to-fine scheme and parallelization, paving the way to numerically solving large-scale optimal transport problems. We demonstrate efficient numerical performance for computing the Wasserstein-2 distance between 2D images and observe that, even without parallelization, domain decomposition compares favorably to applying a single efficient implementation of the Sinkhorn algorithm in terms of runtime, memory and solution quality.


Motivation
(Computational) optimal transport. Optimal transport is a fundamental optimization problem with applications in various branches of mathematics. Let µ and ν be probability measures over spaces X and Y and let Π(µ, ν) be the set of transport plans, i.e. probability measures on X × Y with µ and ν as first and second marginal. Further, let c : X × Y → R be a cost function. The Kantorovich formulation of optimal transport is then given by inf ˆX ×Y c(x, y) dπ(x, y) π ∈ Π(µ, ν) . (1.1) We refer to the monographs [26] and [21] for a thorough introduction and historical context. Due to its geometric intuition and robustness it is becoming particularly popular in image and data analysis. Therefore, one main challenge is the development of efficient numerical methods, which has seen immense progress in recent years, such as solvers for the Monge-Ampère equation [3], semi-discrete methods [16,13], entropic regularization [9], and multi-scale methods [18,24]. An introduction to computational optimal transport, an overview on available efficient algorithms, and applications can be found in [19].
In [2] it is shown that this algorithm converges to the globally optimal solution when µ is Lebesgue absolutely continuous and (X 1 , X 2 , X 3 ) satisfy a 'convex overlap' condition, which roughly states that 'if a function is convex on X {1,2} and X {2,3} , then it must be convex on X'. Further, it is shown that this method can easily be extended to more than three cells, thus leading to a parallelizable algorithm. Unfortunately, it is also demonstrated that the discretized method does not converge to a global minimizer (for d > 1). When the discretization is refined, it is shown that the optimal solution is recovered in the limit. But this requires an increasing number of discretization points in the three sets (X 1 , X 2 , X 3 ) and thus numerically obtaining a good approximation of the globally optimal solution remains challenging with this method.

Outline and contribution
Entropic domain decomposition. The key observation in this article is that for entropic optimal transport the domain decomposition algorithm converges to the unique globally optimal solution for arbitrary (measurable) bounded costs c, as soon as µ(X 2 ) > 0. In particular, this covers the case where X and Y are discrete and finite. Therefore, entropic smoothing is not only a useful tool in solving the optimal transport sub-problems arising in domain decomposition, it also facilitates convergence of the whole algorithm. While the convergence rate can be very slow in suitable worst-case examples, we show empirically that it is fast on problems with sufficient geometric structure, thus leading to a practical and efficient parallel numerical method for largescale problems.
Preliminaries and definition of the algorithm. We start by establishing the basic mathematical setting and notation in Section 2.1. Some basic facts about entropic optimal transport are recalled in Section 2.2. In Section 3 the entropic domain decomposition algorithm is defined and a simple convergence proof is sketched.
Theoretical and empirical worst-case convergence rate. Section 4 is dedicated to proving that the iterates converge to the unique optimal solution linearly with respect to the Kullback-Leibler divergence. We first give a proof for the three-cell example discussed above (Section 4.1) and then generalize it to more general decompositions (Section 4.2). Our results apply to arbitrary (measurable) bounded cost functions c, marginal measures µ and ν, and we make minimal assumptions on the decomposition structure (it must be 'connected' in a suitable sense, Definition 7, in particular convergence only requires µ(X 2 ) > 0 in the three-cell example). As the entropic regularization parameter tends to zero, our rate bound tends to one exponentially. This is reminiscent of the convergence rate for the standard Sinkhorn algorithm established in [10] with respect to Hilbert's projective metric. While we do not expect that our bounds on the convergence rate are tight, we demonstrate in Section 4.3 with some numerical (near) worst-case examples that they capture the qualitative behaviour of the algorithm on difficult problems.
Relation to parallel sorting. The slow convergence rates from Section 4 do not suggest that the algorithm is efficient. But these slow rates rely on maliciously designed counter-examples. In practice usually much more geometric structure is available, similar to the setting originally considered by Benamou [2]. While a detailed analysis of the convergence rate in these cases is beyond the scope of this article, to provide some intuition, we discuss in Section 5 the relation of the domain decomposition algorithm for optimal transport with the odd-even transposition parallel sorting algorithm [14, Exercise 37] in one dimension. This algorithm is known to converge in O(N ) iterations where N is the number of cells that the domain is partitioned into.
Practical implementation and geometric large-scale examples. In Section 6 we provide a practical version of the domain decomposition algorithm, leading to an efficient numerical method for large scale problems with geometric structure. We discuss how the memory footprint can be reduced, how one can handle approximate solutions to the sub-problems as obtained by the Sinkhorn algorithm, and how to combine the algorithm with the ε-scaling heuristic and a multi-scale scheme, see [23].
Numerical experiments and comparison with single Sinkhorn algorithm. The efficiency of the implementation is then demonstrated numerically by computing the Wasserstein-2 distance between images in Section 7. We report accurate approximate primal and dual solutions with low entropic regularization artifacts after a logarithmic number (w.r.t. the marginal size) of domain decomposition iterations. On a standard desktop computer with 6 cores a high-quality approximate optimal transport between two mega-pixel images is computed in approximately 4 minutes.
A comparison to a single Sinkhorn algorithm, when applied to the full problem, is given in Section 7.4. We find that the domain decomposition approach has several key advantages. First, it allows more extensive parallelization. The standard Sinkhorn algorithm is already parallel in the sense that each half-iteration consists essentially of a matrix-vector product that can be parallelized. But the results of this operation must be communicated to all workers after each half-iteration, thus only allowing local parallelization such as a single GPU. In the domain decomposition variant, small sub-problems are solved independently and their results must only be communicated upon completion, thus allowing parallelization over multiple machines.
Second, domain decomposition allows to reduce the memory footprint. The naive Sinkhorn algorithm requires storage of the full kernel matrix, the size of which grows quadratically with the marginal size. This can be avoided by using efficient heuristics, such as Gaussian convolutions or pre-factored heat kernels [25] but these methods only work in particular settings and with sufficiently high regularization. Alternatively, with adaptive kernel truncation [23] the memory demand is approximately linear in the marginal size. But to retain numerical stability, the truncation parameter may not be chosen too aggressively, thus still making memory demand a practical constraint. In the domain decomposition variant, only the kernel matrices for the sub-problems that are currently being re-solved are needed; for all other sub-problems only information about their marginals is kept. In addition, via parallelization this may also be distributed over several computers.
In our numerical experiments we find that in comparable runtime (even without parallelization) the domain decomposition algorithm obtains more accurate primal and dual iterates by requiring only approximately 11% of the memory of the single Sinkhorn solver. Dealing only with small sub-problems at a time also makes it easier to handle numerical delicacies associated with small entropic regularization parameters. The domain decomposition method therefore more reliably solves large problems.
2 Background 2.1 Setting and notation • X and Y are compact metric spaces. We assume compactness to avoid overly technical arguments while covering the numerically relevant setting.
• For a compact metric space Z the set of finite signed Radon measures over Z is denoted by M(Z). The subsets of non-negative and probability measures are denoted by M + (Z) and M 1 (Z). The Radon norm of µ ∈ M(Z) is denoted by µ M(Z) and one has µ M(Z) = µ(Z) for µ ∈ M + (Z). We often simply write µ for the norm when the meaning is clear from context.
• For µ ∈ M + (Z) we denote by L 1 (Z, µ), L ∞ (Z, µ) the usual function spaces and add a subscript + to denote the subsets of µ-a.e. non-negative functions. The corresponding norms are denoted by · L 1 (Z,µ) and · L ∞ (Z,µ) but we often merely write · 1 and · ∞ (or even · for the latter) when space and measure are clear from context.
• For µ ∈ M + (Z) and a measurable function u : Z → R we denote by µ, u the integration of u with respect to µ, and define the normalized integral • For µ ∈ M + (Z) and measurable S ⊂ Z the restriction of µ to S is denoted by µ S. Similarly, for a measurable function u : Z → R, set u S(x) = u(x) if x ∈ S, and 0 otherwise.
• The maps P X : M + (X × Y ) → M + (X) and P Y : M + (X × Y ) → M + (Y ) denote the projections of measures on X × Y to their marginals, i.e.

Entropic optimal transport
We recall in this Section the entropic regularization approach to optimal transport and collect the main existence and characterization results which will be needed in this article.
Definition 2 (Entropy regularized optimal transport). Let µ ∈ M 1 (X), ν ∈ M 1 (Y ) and µ ∈ M + (X) andν ∈ M + (Y ) such that Similar to above, denote by Π(μ,ν) := {π ∈ M + (X × Y ) | P X π =μ, P Y π =ν} the set of transport plans betweenμ andν. The condition μ = ν ensures that the set is non-empty. For a lower-semicontinuous cost function c ∈ L ∞ + (X × Y, µ ⊗ ν) the Kantorovich optimal transport problem betweenμ andν is then given by Existence of minimizers in this setting is provided, for instance, by [26,Theorem 4.1]. For a regularization parameter ε > 0 define In this article we will frequently exploit that k is bounded by 0 < exp(− c ∞ )/ε) ≤ k ≤ 1. Then the entropic optimal transport problem, regularized with respect to µ ⊗ ν is given by Problem (2.4) can be solved (approximately) with the Sinkhorn algorithm (Remark 1). In addition, in our article, entropic smoothing will ensure convergence of the domain decomposition scheme. We refer to [19,Chapter 4] for an overview on entropic optimal transport and its numerical implications. The (Γ-)convergence of (2.4) to (2.2) as ε → 0 has been shown in [7,15,6] in various settings and with different strategies.
We now collect some results about entropic optimal transport required in this article. A proof is given in the appendix.
Definition 3 (Dual problem). In addition to (2.4), we will consider a corresponding dual problem which, for the purpose of this article, is best stated as This is a reparametrization of the usual dual problem where one uses ε log u and ε log v as variables instead (cf. [23]). Again, we briefly collect a few helpful results. A proof is given in the appendix.
We refer to these two steps as Y -and X-iteration. It is quickly verified that a Y -iteration corresponds to a partial optimization of J, (2.9), over v for fixed u. Conversely, an X-iteration is obtained by optimizing over u for fixed v. Hence the sequence (u ( ) , v ( ) ) is a maximizing sequence of (2.9). Complementarily, the sequence of measures π ( ) := (u ( ) ⊗ v ( ) ) · K converges (under suitable assumptions) to the solution of (2.4). In general one has P Y π ( ) = ν but P X π ( ) = µ. Conversely, for π ( +1/2) := (u ( +1) ⊗ v ( ) ) · K one has P X π ( +1/2) = µ but P Y π ( +1/2) = ν. Again, for a thorough overview we refer to [19,Chapter 4]. A computationally efficient implementation is discussed in [23], which we use as a reference in this article.

Entropic domain decomposition algorithm
Throughout this article we will be concerned with solving . By Proposition 1, (3.1) has a unique minimizer π * that can be represented as π * = (u * ⊗ v * ) · K for corresponding scaling factors u * , v * .
Following Benamou [2] (see Section 1.1), the strategy of the algorithm is as follows: Divide the space X into two partitions J A and J B (which imply two partitions of the product space X × Y ) that need to 'overlap' in a suitable sense. Then, starting from an initial feasible coupling π 0 ∈ Π(µ, ν), optimize the coupling on each of the cells of J A separately, while keeping the marginals on that cell fixed. This can be done independently and in parallel for each cell and the coupling attains a potentially better score while remaining feasible. Then repeat this step on partition J B , then again on J A and so on, continuing to alternate between the two partitions.
To construct J A and J B we first define a basic partition of small cells. The two overlapping partitions are then created by suitable merging of the basic cells. This induces a graph over the basic partition cells as vertices. The algorithm converges when this graph is connected, see Section 4 for details.
Definition 4 (Basic and composite partitions). A partition of X into measurable sets {X i } i∈I , for some finite index set I, is called a basic partition of (X, µ) if the measures µ i := µ X i for i ∈ I satisfy µ i > 0. By construction one has i∈I µ i = µ. In addition we will write K i := K (X i × Y ) for i ∈ I. Often we will refer to a basic partition merely by the index set I.
For a basic partition {X i } i∈I of (X, µ) a composite partition J is a partition of I. For J ∈ J we will use the following notation: Throughout the article we will use one basic partition I and two corresponding composite partitions J A and J B . A formal statement of the algorithm is given in Algorithm 1.

Example 1.
In the three-cell example from Section 1.1, the basic partition is given by {X 1 , X 2 , X 3 }, I = {1, 2, 3} and the two composite partitions are given by J A = {{1, 2}, {3}} and J B = {{1}, {2, 3}}. For odd, i.e. during an J A -iteration, the optimization on the 'single cell' {3} can be skipped since it is made redundant by the optimization on {2, 3} in the subsequent J B -iteration: One has P Y π ( ) (X 3 × Y )) = P Y π ( −1) (X 3 × Y )) and thus, when computing the partial marginal for cell {2, 3} in the next iteration one finds The same holds for the cell {1} during the J B -iteration. We then obtain the method described in the introduction.
Definition 5 (Notation of iterates). For ≥ 1 the following sequences will be used throughout the rest of this article: (ii) The ratios of the u ( ) J -scaling factor are bounded, (iv) The product of the scaling factors u for µ i -a.e. x ∈ X i and ν-a.e. y ∈ Y .
(v) For any j ∈ J, one has To provide some intuition, we now sketch a simple proof that Algorithm 1 converges to the globally optimal solution. For simplicity, we restrict it to the three-cell problem and discrete (and finite) spaces X and Y . While it could be extended to the general setting, we instead refer to Section 4. Sketch. When X and Y are discrete, the spaces of measures and measurable functions all become finite-dimensional real vector spaces. For simplicity, we may assume that µ and ν assign strictly positive mass to every point in X and Y . Then K assigns strictly positive mass to every point in X × Y and therefore π K for any π ∈ M + (X × Y ). The function KL(·|K) is therefore finite and continuous on M + (X × Y ). Denote by S the 'solving' map M + (X × Y ) π → arg min{ε KL(π|K)|π ∈ Π(P X π, P Y π)} which is well-defined.
We show that S is continuous: Let (π n ) n be a sequence of measures in M + (X ×Y ) converging to π ∞ (this implies that the entries of π n are uniformly bounded), then S(π n ) = u n ⊗ v n · K for suitable u n , v n by Proposition 1. By re-scaling, with (2.6,2.7) we may assume that u n and v n are uniformly bounded and by compactness in finite dimensions, we may extract some cluster points u ∞ and v ∞ . Since u n ⊗v n ·K ∈ Π(P X π n , P Y π n ), one finds that u ∞ ⊗v ∞ ·K ∈ Π(P X π ∞ , P Y π ∞ )}, which must be equal to S(π ∞ ) by Proposition 2 (iii). Therefore, S(π n ) converges to S(π ∞ ) and S is continuous.
Let now F A be the map that takes an iterate π ( −1) to π ( ) when is odd, and F B the map for even. In the discrete setting, the restriction of a measure to a subset is a continuous map, and since F A and F B are built from measure restriction and the solving map S, they are continuous.

Linear convergence
In this Section we prove linear convergence of Algorithm 1 to the globally optimal solution of (3.1) with respect to the Kullback-Leibler divergence. We start with the proof for the three-cell setup in Section 4.1 and finally the proof for general decompositions in Section 4.2. Numerical worst-case examples that demonstrate the qualitative accuracy of our convergence rate bounds are presented in Section 4.3.
Remark 2 (Proof strategy). The basic strategy for the proofs is inspired by [17] where linear convergence for block coordinate descent methods is shown. Indeed, the domain decomposition algorithm can be interpreted as a block coordinate descent for the entropic optimal transport problem, where in each iteration we add some additional artificial constraints that fix the Ymarginals of the current candidate coupling on each of the composite cells. The fundamental strategy of [17] is to find a bound ∆(π ( ) ) ≤ C ∆(π ( −k) ) − ∆(π ( ) ) for some C ∈ R + where k = 1 in the three-cell setting and k > 1 in the general setting. This then quickly implies linear convergence of ∆(π ( ) ) to 0. In this article, we exploit the identity ∆(π ( −k) ) − ∆(π ( ) ) = KL(π ( −k) |π ( ) ) (Lemma 3) and we use a primal-dual gap estimate (Proposition 2 (ii)) to bound ∆(π ( ) ) ≤ KL(π ( ) |π) for a suitableπ of the formû ⊗v · K (Lemma 4) and then use carefully chosenû andv to control the latter by the former. A major challenge is that Algorithm 1 does not produce any 'full' dual iterates. We may only use the dual variables for the composite cell problems in line 8, but they are not necessarily consistent across multiple cells. Inspired by the proof of Proposition 4 we propose a way to approximately 'glue' them together to obtain a global candidate.

Three cells
We first state the main result of this Section. Theorem 1 (Linear convergence for three-cell decomposition). Let (X 1 , X 2 , X 3 ), I = {1, 2, 3}, be a basic partition of (X, µ) into three cells and set Then, for iterates of Algorithm 1 one has for > 1, In particular, the domain decomposition algorithm converges to the optimal solution.
The proof for Theorem 1 requires some some auxiliary Lemmas. Most of the Lemmas are already formulated for more general partitions to allow reusing them in the next Section. The Lemmas strongly rely on algebraic properties of the KL divergence that are used in a similar way throughout the literature, see for example [1,Lemma 2]. Then We briefly recall from Section 2.1 at this point that ρ, f denotes integration of the measurable function f against the measure ρ.
Proof. Fix > 1. With the notation of Definition 5 we can decompose the current and previous iterate as with all partial scalings being essentially bounded with respect to µ or ν. So, by using Lemma 2, we compute Note now that, for any J ∈ J ( ) , the Y -marginal on each composite cell is kept fixed during the iteration (see Algorithm 1, lines 7-8), i.e., Thus, we can continue: The proof of the second statement follows the same steps. It hinges on the fact that P X π Proof. If KL(π|π) = +∞ there is nothing to prove. Therefore, assume KL(π|π) < +∞. Abbre- Here the integral in the first line is finite since KL(π i |π i ) ≤ KL(π|π) < ∞. The first integral in the second line is finite since log (ũ i ⊗ṽ i ) ∈ L 1 (X × Y,π i ), arguing as in Proposition 1 (iv), and therefore so is the second. Since all scaling factors are essentially bounded, we can split the products within the logarithms into separate integrals in the third line where we also used π ∈ Π(µ, ν). In the fourth line we use the definition of J, (2.9). The last inequality is due to Proposition 2, since for the primal minimizer π * we find KL(π * |K) ≥ J(û,v) and the primal-dual gap is thus a bound for the suboptimality (4.1).
Theorem 1. Assume first that > 1 is odd, and so the current composite partition is On one hand, using Lemma 3, one finds On the other hand, with Lemma 4, one can bound ∆(π ( ) ) ≤ KL(π ( ) |π) for a suitableπ of the formπ = (û ⊗v) · K. Using a piecewise definition on the partition cells forû, we set herê where the factor q ∈ R ++ is to be determined later (observe thatû andv are upper bounded, thanks to Lemma 1). With this choice ofπ we find that π ( ) To complete the proof we must now find a constant C ∈ R + such that To this end we write Now apply (3.3) at iterate ( − 1) and for J = {2, 3} to obtain that for µ-a.e. x ∈ X 2 , x 3 ∈ X 3 and for ν-a.e. y ∈ Y it holds Further, note that for a ≥ 0 the map Φ : s → ϕ a s · s is convex on R ++ and thus by Jensen's inequality one has for ν where we now fix the value q := ffl , cf. Lemma 1 (i,iii)). We recall the notation for the normalized integral from (2.1). Plugging first (4.10) and then (4.11) into (4.9) one obtains for µ-almost all Now we work backwards from (4.8). With the specified choices forû andv made in (4.7), one and with (4.6) and (4.8) For even one proceed analogously to obtain Hence, for any > 0, using which implies (4.2). Eventually, the sequence (π ( ) ) is bounded and each element lies in Π(µ, ν).

General decompositions
Now let I be a basic partition of (X, µ) into N basic cells and let J A and J B be two composite partitions of I. In the special case of three cells discussed in the previous Section, convergence of the algorithm was driven by the overlap of the two partitions on the middle cell. In the general case this overlap structure will be captured by the partition graph and related auxiliary objects that we introduce now.
Definition 7 (Partition graph, cell distance and shortest paths).
(i) The partition graph is given by the vertex set I (i.e. each basic cell is represented by one vertex) and the edge set i.e. there is an edge between two basic cells if they are part of the same composite cell in either of the two composite partitions J A or J B .
(ii) We assume that no two basic cells are simultaneously contained in a composite cell of (otherwise i and j should be merged into one basic cell). This means that every edge in the partition graph is associated to precisely one of the two composite partitions.
(iii) We assume that the partition graph is connected.
(iv) We denote by dist : I × I → N 0 the discrete metric on this graph induced by shortest paths, where each edge has length 1.
(v) Let J 0 ∈ J A be a selected composite cell. We define the cell distance (vi) For each i ∈ I we select one shortest path in the graph from J 0 to i, which we represent by a tuple (n i,k ) k=0 of elements in I with n i,0 ∈ J 0 , n i,D(i) = i, and D(n i,k ) = k for k ∈ {0, . . . , D(i)}. We refer to Fig. 2 for an illustration of such a construction. One can easily verify that, within any shortest path, the basic cells n i,k and n i,k−1 belong to a common composite cell of partition J A when k is even, and J B when k is odd, or equivalently, of J ( −k) for some odd > M , see Fig. 3.
Let us fix for the rest of this Section an odd iterate so that J ( ) = J A (the same will then apply for even by swapping the roles of J A and J B ).
where µ min := min{ µ i |i ∈ I}. In particular, the domain decomposition algorithm converges to the optimal solution. Remark 3 (Proof strategy). For the three-cell case the proof relied on the fact that π ( ) 3 remained unchanged in an A-iteration, since {3} ∈ J A was a composite cell containing only basic cell 3. This is no longer true for general decompositions. To overcome this, we need to replace parts of the primal iterate π ( ) by partial transport plans from previous iterations (Lemma 5). When bounding the sub-optimality by the decrement (cf. Remark 2) we also need an approximate triangle inequality for the KL-divergence (Lemma 7).
Proof. In the proof we assume that is odd (i.e. an A-iteration was just completed). The proof for even works completely analogous by swapping the roles of J A and J B .
We start with the claim thatπ ∈ Π(µ, ν). Sinceπ is a sum of non-negative measures it is non-negative. By construction of the algorithm one always has P X π (m) i = µ i for all iterations m ∈ N 0 and thus For k ∈ {0, . . . , M } we introduce the auxiliary measures Assume now that k is odd. Recalling that is assumed to be odd, then the set J k := {i ∈ I : D(i) ≥ k} is a union of A-cells and the iteration step from π ( −k) to π ( −k+1) is an iteration on A-cells (since − k + 1 is odd, recall also Definition 7), i.e., for any J ∈ J A one has Since J k is a union of A-cells we find therefore Using this, we find that P Yπ k = P Yπ k−1 for k ∈ {1, . . . , M } and k odd, and we can argue in complete analogy for even k via B-cells. Consequently, and thusπ ∈ Π(µ, ν).
Now we establish the KL bound. Again, assume that k is odd, so that J k is a union of A-cells and the iteration step from π ( −k) to π ( −k+1) is an iteration on A-cells. Then for any J ∈ J A one finds and since J k is a union of A-cells one gets that Again, an analogous argument holds when k is even. Consequently one has Lemma 6 (Global density bound). Let > M . Then for ν-a.e. y ∈ Y and all i ∈ I one has for ν-a.e. y ∈ Y . Applying this bound recursively, the max runs over increasingly many cells and after M + 1 applications of the bound we find: Observe now that j∈I dν ( −M −1) j dν (y) = 1 and thus the maximum must be bounded from below by 1/N .
where the factors q i,k are defined as and are finite thanks to the lower and upper boundedness of u-scalings, cf. Lemma 1, (i,iii). Thus, with Lemmas 3, 4 andπ from Lemma 5, one obtains where one has For i ∈ J 0 one findsπ i almost everywhere that . To control KL(π i |π i ) for i ∈ I with D(i) > 0 we will use the following bound for k ∈ {0, . . . , D(i) − 1}: for µ n i,k -a.e.x ∈ X n i,k , where in the first inequality we have used (3.3) and the boundedness of c, and in the second inequality we have used Jensen's inequality as in (4.11) in the proof of Theorem 1. Now note that one has v ( −k) n i,k−1 for k ∈ {1, . . . , D(i)}, since {n i,k , n i,k−1 } ∈ J for some J ∈ J ( −k) , cf. Def. 7 (vi) and Fig. 3. Therefore, for i with D(i) > 0 and recalling n i,D(i) = i, one obtainsπ i -almost everywhere that (4.20) So continuing the suboptimality bound from (4.18) we obtain ∆(π ( ) ) ≤ i∈I: In the following we now transform the 'ϕ of product' into a 'sum of ϕ' via Lemma 7. For each term we then use several auxiliary results to boundπ i from below by the appropriate π Using the definition of q i,k in (4.17) and (3.5) we find that and so by using Lemma 7 one obtains for and therefore, applying iteratively this bound along the chain connecting X i to X n i,0 , we obtain for µ i -a.e. x ∈ X i and µ n i,0 -a.e. x ∈ X n i,0 Combining successively the upper bound in (3.4), (4.15) and the lower bound in (3.4) one deduces for (µ n i,0 ⊗ µ n i,k )-a.e. (x, x ) ∈ X n i,0 × X n i,k and ν-a.e. y ∈ Y that and we set Eventually, applying again (3.3), we can easily see that We now have all the ingredients to go back to (4.21). For every fixed i ∈ I withπ i =û i ⊗v i ·K i and 0 ≤ k ≤ D(i) − 1, and for µ n i,0 -a.e. x ∈ X n i,0 and µ n i,k -a.e. x ∈ X n i,k , we computê This eventually yields where we used in the second line that D(j) = k if j = n i,k for j ∈ I (since n i,k is at the k-th step of the shortest path to cell i). This yields the sought-after bound: Convergence of the iterates to the optimal solution follows as in Theorem 1.

Numerical worst-case examples
In the preceding Sections we derived bounds on the convergence rate based on various worst-case estimates. But convergence might actually be much faster. In this Section we provide numerical examples that demonstrate that, while the pre-factors in our bounds might not be tight, in the three-cell-case the qualitative dependency of the convergence rate on the parameters ε and the mass in the middle cell µ(X 2 ) are accurate; and that convergence slows down in the general case as the maximum distance M from the root cell J 0 in the partition graph increases.
In the unregularized case π (0) would be a fixed-point of the iterations, since π (0) is partially optimal on X {1,2} and X {2,3} . This provides a simple example that the unregularized domain decomposition algorithm requires additional assumptions for convergence to the global solution.
In the regularized setting, some mass will also be put onto the points (1, 2), (2, 1), (2, 3), and (3, 2) despite the high cost, thus leading to the possibility to move mass between X 1 to X 3 via X 2 and thus eventually solving the problem. But this mass tends to zero exponentially as ε → 0 thus resulting in an exponential number of iterations.
Numerical results are summarized in Figure 5. To estimate ∆(π ( ) ) we obtain a high precision approximate solution π * by applying a single Sinkhorn algorithm to the problem. Since this can directly exchange mass between cells 1 and 3, for the single solver this problem is not particularly difficult (the same applies to the other examples in this Section). Let λ(ε) be the empirical contraction factor for the sub-optimality. Theorem 1 yields the bound As indicated in Figure 5, the 1/ε-dependency accurately describes the convergence behaviour.
Example 3 (Three cells, dependency on q). We revisit the previous Example 2, but now we fix the regularization parameter ε = 10 and vary the mass q in the middle cell. Intuitively, Right: We extract the linear contraction factor λ(ε) from the left plot and compare it to the theoretical bound from Theorem 1. This is visualized best in the form − log(λ(ε) −1 − 1) for which the bound yields 2 c ε − log(q/(1 − q)), see (4.25). In particular, the 1/ε-behaviour is accurate and the pre-factor appears to be off only by a factor of ≈ 2.
far from the optimal solution, the mass exchanged between the cells in each iteration should be proportional to q and thus for small q the number of required iterations should be approximately proportional to q −1 . This would imply that for small q the contraction ratio λ(q) scales as exp(−C q) ≈ 1 − C q for some constant C. For small q this matches the behaviour of the bound of Theorem 1, which implies, cf. (4.25), Numerical results are summarized in Figure 6. In our setting the proportionality between λ(q) −1 − 1 and q is confirmed numerically. Note that we chose ε = 10 relatively high. For smaller values we observed that the mass exchanged between the cells in each iterations was no longer proportional to q and thus the interplay between the parameters ε and q became more complicated.
Example 4 (Chain graph, dependency on graph size). The dependency of the contraction ratio bound of Theorem 2 on the graph structure is more complex. The graph structure determines the numbers M , N and implicitly µ min (since mass must be distributed over increasingly many cells). Due to the variety of worst-case estimates that entered into the proof of (4.13) we do not expect it to be particularly tight. Nevertheless, it is easy to confirm numerically that convergence slows down when the graph size increases.  Right: We extract the linear contraction factor λ(q) from the left plot and compare it to the theoretical bound from Theorem 1. This is visualized best in the form λ(q) −1 − 1 for which the bound yields exp(−2 c /ε) · q 1−q , see (4.26). In particular, the proportionality to q (for small values) is accurate. The pre-factor appears to be off by a factor of ≈ 50.
The unique optimal coupling is given by π 1) . c and π (0) are illustrated in Figure 4. For N even, we set  Figure 7. We find that λ(N ) approaches 1 as N increases.

Relation to parallel sorting
The linear rates obtained in Theorems 1 and 2 become very slow for small ε. The number of required iterations to achieve prescribed sub-optimality scales like exp(C/ε) as ε approaches zero. This rate is comparable to the result by Franklin and Lorenz [10] on the convergence of the discrete Sinkhorn algorithm. They show that the intermediate marginal µ ( ) = P X π ( ) in the Sinkhorn algorithm satisfies d(µ ( ) , µ) ≤ q · d(µ (0) , µ) where d denotes Hilbert's projective metric and q ≈ 1 − 4 exp(− c /ε) for ε c . While we do not claim that the bounds in Theorems 1 and 2 are tight, they at least qualitatively capture the right scaling behaviour with respect to ε, the mass contained in the basic cells, and the (approximate) graph diameter M , as demonstrated with some worst-case examples in Section 4.3.
We emphasize however that in Section 4 we make virtually no assumptions on the cost function (beyond bounded) and on the structure of the basic and composite partitions (beyond basic cells carrying non-zero mass and a connected partition graph). This freedom is essential for the construction of the examples in Section 4.3. Conversely, in practice we are usually not dealing with such restive problems, but often need to solve transport problems on lowdimensional domains, with highly structured cost functions, and it is possible to choose highly structured basic and composite partitions. Based on the intuition from Figure 1 we expect that the algorithm will perform much better in such settings, which is confirmed by numerical experiments in Section 7. Unfortunately, a convergence analysis that leverages the additional geometric structure to obtain better rates is much more challenging than the worst-case bounds from Section 4 and thus beyond the scope of this article.
To provide at least some insight we discuss in this Section an example of discrete, onedimensional, unregularized optimal transport where the domain decomposition algorithm becomes equivalent to the parallel odd-even transposition sorting algorithm. Let for some N ∈ N. Let µ and ν be the normalized counting measures on X and Y , which can be written as and let c : X × Y → R + satisfy the strict Monge property [5], i.e.
A simple example for such a setting would be when X and Y are (sorted) sets of points in R, x i < x j , y i < y j for 1 ≤ i < j ≤ N , and c(x i , y j ) = h(x i − y j ) for a strictly convex function h. This covers the Wasserstein distances for p > 1.
For this setting we are now interested in the unregularized optimal transport problem min ˆX ×Y c dπ π ∈ Π(µ, ν) .

(5.2)
It is easy to see that this is equivalent to solving the linear assignment problem between the sets X and Y with respect to the cost c and that the unique optimal solution is given by π = 1 N N i=1 δ (x i ,y i ) (uniqueness is implied by strict inequality in (5.1)). Let now a basic partition of X be given by the partition into singletons, i.e. X i = {x i } for i ∈ I = {1, . . . , N } and the two composite partitions J A and J B will be set to 2}, {3, 4}, . . . , {N − 1, N }), J B = ({1}, {2, 3}, . . . , {N − 2, N − 1}, {N }) (5.3) where for simplicity of exposition we assume that N is even. Further, let π (0) = 1 where σ : {1, . . . , N } → {1, . . . , N } is some permutation. We can now conceive running an unregularized version of Algorithm 1 where we replace the regularized transport solver in line 8 by the unregularized counterpart. It can quickly be seen that (due to the strict Monge property) the solution of the (now unregularized) sub-problems in line 8 is always unique and at each iteration π ( ) corresponds to a permutation σ ( ) . Each composite cell (with the exception of the cells {1} and {N } in J B ) consists of two consecutive elements of X and in line 8 it is tested whether the two corresponding assigned points in Y should be kept in the current order or flipped. Therefore, in this special case the domain decomposition algorithm reduces to the odd-even transposition parallel sorting algorithm [14,Exercise 37] which is known to converge in O(N ) iterations. It is not hard to see that the analysis for odd-even transposition sort could be adapted to the continuous one-dimensional strictly-Monge case to establish convergence in O(1/µ min ) steps.
Unfortunately, the techniques do not generalize to multiple dimensions and an accurate analysis of the convergence speed of the domain decomposition algorithm in that setting is therefore still an open problem. Nevertheless, we are optimistic that the expected performance for 'structured problems' in higher dimensions will be much better than the worst-case bounds from Section 4, which is confirmed numerically in Section 7.

.1 Computational adaptations
Our strongest motivation for studying the entropic domain decomposition algorithm is to develop an efficient numerical method. The statement given in Algorithm 1 is mathematically abstract and served as preparation for theoretical analysis. In this Section we describe a more concrete computational implementation, formalized in Algorithm 2. We summarize the main modifications: (i) Instead of storing the full coupling π ( ) , we only store the Y -marginal ν (ii) To further reduce the memory footprint, we truncate ν ( ) i after each iteration, i.e. we only keep a sparse representation of the entries above a small threshold, which we set to 10 −15 . In theory this invalidates the global density bound (4.15) in Lemma 6 and thus jeopardizes convergence of the method. In practice our threshold is low enough such that ν ( ) i on neighbouring basic cells have strong overlap and thus, based on the intuition from Section 5, on geometric problems we still expect convergence to an approximate solution. This is can be confirmed numerically via the primal-dual gap (Section 7.3 and Table 1).

Algorithm 2 Practical implementation of entropic domain decomposition
Input: • initial feasible basic Y -marginals (ν 0 i ) i∈I (i.e. satisfying ν 0 i = µ i for i ∈ I and i∈I ν 0 i = ν) • scaling factor u 0 ∈ L ∞ + (X, µ) Output: if ( is odd) then C ← A else C ← B // select the partition 8: for all J ∈ J C do // solve over each composite cell 9: ν cell ← i∈J ν i // compute Y -marginal on cell 10: π cell , u C,J ← Sinkhorn(µ J , ν cell , u C,J ) // approx. sol. w. Sinkhorn algorithm 11: for all i ∈ J do: ν i ← P Y (π cell (X i × Y )) // extract new basic partial marginals 12: BalanceMeasures((ν i ) i∈J )) 13: for all i ∈ J do: Truncate(ν i ) 14: end for 15: until stopping criterion 16: The functions Sinkhorn and BalanceMeasures are described in Section 6.2. The function Truncate refers to truncation up to a small threshold to obtain sparse vectors, as described in Section 6.1, item (ii). The variable π cell is a local variable that is not stored beyond the for-loop over J C . This means, we need to be able to handle approximate solutions for the cell problems. We describe how this is done in Section 6.2.
(iv) We explicitly keep track of the partial scaling factors u on the composite cells of J A and J B . This allows accurate initialization of the Sinkhorn algorithm on each composite cell, in particular after the first few iterations. Further, it enables us to quickly reconstruct the partial couplings on composite and basic cells. Finally, we can use them to estimate the sub-optimality via the primal-dual gap, see Section 6.3.
(v) We combine Algorithm 2 with a multi-scale scheme and the ε-scaling heuristic described in [23]. This is critical for computational efficiency, as it allows to obtain a good (and sparse) initial coupling π 0 (or its partial Y -marginals on the basic cells) and drastically reduces the number of required iterations. Some more details are given in Section 6.4.

Sinkhorn solver and handling approximate cell solutions
The Sinkhorn solver at line 10 in Algorithm 2 only returns an approximate solution. This requires some additional post-processing. Note that for efficiency, we initialize the solver with the scaling factor u C,J from the previous iteration, i.e. we start the solver with a Y -iteration (Remark 1). As stopping criterion of the Sinkhorn solver we use the L 1 -marginal error of the X-marginal after the Y -iteration. We require that this error is smaller than µ J · Err where Err is some global error threshold. In this way, the summed L 1 -error of all X-marginals over all composite cells is bounded by Err after each iteration.
We terminate the algorithm after a Y -iteration, i.e. the Y -marginals are satisfied exactly, P Y π cell = ν cell , and after the for-loop in line 14 one has i∈I ν i = ν, which is crucial for the validity of the domain decomposition scheme. Since µ J = µ X J is constant throughout the algorithm it is easier to handle a deviation between P X π cell and µ J .
However, when π cell is not contained in Π(µ J , ν cell ), after computing the basic cell marginals in line 11 it may occur that ν i = µ i . This means, that while i∈J ν i = ν cell , the masses of the ν i may not exactly match the masses in their basic cells. This would cause problems in subsequent iterations. Therefore, the function BalanceMeasures determines for each basic cell i ∈ J the deviation ν i − µ i and then moves mass between the different (ν i ) i∈J to compensate the deviations, while preserving their non-negativity and their sum. This can be done efficiently directly in the sparse data structure and without adding new support points.

Estimating the primal-dual gap
Explicitly evaluating the primal-dual gap between (2.4) and (2.8) (see Proposition 2 (ii)) is a useful tool to verify that a high-quality approximate solution was found. As primal candidate we can use the current primal iterate. We can sum up the contribution of each π cell to (2.4) in the for-loop starting at line 8. The partial replacement with older iterates, as in Lemma 5, was merely required for the proof strategy, for numerical evaluation it is not necessary.
As a dual candidate we could useû andv as constructed in (4.16,4.17). It is not hard to see that upon perfect convergence of the domain decomposition algorithm, the functions u ( ) on two overlapping composite cells J A ∈ J A , J B ∈ J B would only differ by a constant factor on their overlap J A ∩ J B = ∅. Consequently, this 'glued' together (û,v) would be a dual maximizer (cf. proof strategy of Proposition 4). In practice, before perfect convergence, we find however that this construction yields unnecessarily bad dual estimates. We now sketch a slightly more sophisticated construction (along with an explanation why the above construction does not work well numerically).
Inspired by the partition graph, Definition 7, and the construction (4.16,4.17) we now define another weighted graph. The vertex set is given by the composite cells J A . Between two vertices J 1 , J 2 ∈ J A there will be an edge if there exists some J B ∈ J B such that J 1 ∩ J B = ∅ and J 2 ∩ J B = ∅. The corresponding edge-weight will be set to Note that the weight depends on the orientation of the edge and that there may be multiple edges between the same two vertices. Similar to (4.16,4.17) we could now fix a root cell J 0 ∈ J A and for any J ∈ J A we fix some path (J 0 , J 1 , . . . , J n = J) and set Again, after perfect convergence, theû obtained by gluing theû J would be a dual maximal X-scaling. The value ofû J would not depend on the choice of the path from J 0 to J and when going around a cycle in the graph, the product of the corresponding weight factors will be equal to 1. This means that log q (J 1 ,J 2 ) is a discrete gradient on the graph, i.e. summing log q (J 1 ,J 2 ) around a cycle yields zero. But before perfect convergence, the weights are inconsistent in the sense that log q (J 1 ,J 2 ) is not a gradient and by picking some arbitrary path from J 0 to J we distribute this inconsistency in an arbitrary way overû J . In practice this leads to poor candidatesû, in particular far from the root cell.
We weaken this effect by doing a discrete Helmholtz decomposition of log q (J 1 ,J 2 ) . More concretely, we look for a minimizer of The gradient of the minimizing potential V will be an optimal approximation of the edge weights in a mean-square sense. This problem can be solved quickly via a linear system, also when there are multiple edges between two vertices. We then setû J := exp(V J ) · u A,J . We observe that this distributes the inconsistencies in the weights more evenly over the graph and yields better dual estimates.
Since we truncate the partial Y -marginals (see Section 6.1 (ii)), unlike in the theoretical analysis the factors v J are not defined on all of Y . But once the factors V are available, the v J can be 'glued' together similar toû.
In practice this method yields small primal-dual gaps (cf. Section 7.3), confirming that the entropic domain decomposition works well.

Multi-scale scheme and ε-scaling
In [23] it is described in detail how a combination of a multi-scale scheme and ε-scaling increases the computational efficiency of the Sinkhorn algorithm and how to implement them numerically. Similarly, the two techniques are crucial for computational efficiency of the domain decomposition scheme. For instance, to initialize Algorithms 1 and 2 one requires a feasible coupling π 0 ∈ Π(µ, ν) (or its partial Y -marginals on basic cells). Without further prior knowledge, the only canonical candidate would be the product measure µ ⊗ ν. But this is almost certainly far from the optimal coupling, hence requiring many iterations, and it is dense, i.e. requiring a lot of memory in the initial iterations. Instead, as usual, we start by solving the problem on a coarse scale and then refine the (approximate) solutions to subsequent finer scales for initialization.
We now provide a few more details on how this is done. Assume X and Y are discrete, finite sets. LetX andŶ be coarse approximations of X and Y , let pa : X Y →X Ŷ be a function that assigns to every element of X and Y their parent element inX andŶ . Coarse approximations of µ and ν are then given byμ := pa µ andν := pa ν where pa denotes the push-forward of measures. Let {X i } i∈Î be a basic partition of (X,μ) and let {X i } i∈I be a basic partition of (X, µ). We choose them in such a way that the fine basic partition can be interpreted as a refinement of the coarse basic partition, i.e. for each i ∈ I there is some unique j ∈Î such that pa(x) ∈X j for all x ∈ X i . We denote the function I →Î, taking i to j by Pa.
Let now (ν j ) j∈Î be the tuple of partial marginals as returned by Algorithm 2 on the coarse scale. We use as initialization on the subsequent finer scale the measures defined via where by a slight abuse of notation, we write ν(y) to mean ν({y}) for singletons.
Proof. We need to show that i∈I ν 0 i = ν and ν 0 i = µ i for all i ∈ I. For the former, for y ∈ Y we find i∈I ν 0 i (y) = j∈Î i∈I: j∈Îν j (pa(y)) ν(pa(y)) i∈I: where in the third expression we first see that the third factor equals 1, since pa −1 (X j ) = ∪ i∈I:Pa(i)=j X i , and subsequently the second factor equals 1, since j∈Îν j =ν, which is implied by the domain decomposition algorithm on the coarse scale. Similarly, for the latter, for i ∈ I, where we have used the consistency condition ν j = μ j =μ(X j ) for j ∈Î in the coarse domain decomposition algorithm.
Of course this scheme can be repeated over multiple successive levels of approximations.
In addition, we refine the scaling factors u from the coarse to the fine level to obtain good initializations. For this, we first 'glue' together the final u at the coarse level, as described in Section 6.3. Then we do a linear interpolation onto the fine points (which we perform in log-space).
Finally, we remark that when changing the parameter ε during ε-scaling from ε old to ε new , the scalings (u, v) should not be left constant, but instead (ε · log(u), ε · log(v)) should be kept constant, so we set u new := exp( ε old εnew · log u old ), and similarly for v new , cf. [23, Section 3.2].

Numerical geometric large scale examples
We now demonstrate the efficiency of Algorithm 2 on large-scale geometric problems and show that it compares favorably to the single Sinkhorn algorithm.

Preliminaries
Hardware and implementation. All numerical experiments were performed on a standard desktop computer with an Intel Core i7-8700 CPU with 6 physical cores and 32 GB of RAM, running a standard Ubuntu-based distribution. We developed an experimental implementation for Python 3.7. Parallelization was implemented with MPI, via the module mpi4py. 1 We used a simple master/worker setup with one thread (master) keeping track of the overall problem and submitting the sub-tasks to the other threads (workers). Parallelization was used for solving of the composite cell problems, the BalanceMeasures function and data refinement from coarse to fine layers.
Stopping criterion and measure truncation. We set the stopping criterion for the internal Sinkhorn solver to Err = 10 −4 (see Section 6.2). This bounds the L 1 -marginal error of the primal iterates. Partial marginals (ν i ) i∈I are truncated at 10 −15 and stored as sparse vectors.
Test data. We focus on solving the Wasserstein-2 optimal transport problem, i.e. we set c(x, y) = x − y 2 but the scheme can be applied to arbitrary costs and we expect efficient performance for any cost of the form c(x, y) = h(x − y) for strictly convex h, such that (in the continuous, unregularized setting) the unique optimal plan is concentrated on the graph of a Monge map, see [11]. As test data we use 2D images with dimensions 2 n × 2 n for n = 6 to n = 11, i.e. from 64 × 64 up to 2048 × 2048, the number of pixels per image ranging from 4.1E3 to 4.2E6. Similar to [22,23] the images were randomly generated, consisting of mixtures of Gaussians with random variances and magnitudes. This represents challenging problem data, leading to non-trivial optimal couplings, involving strong compression, expansion and anisotropies, as visualized in Figures 8 and 11. For each problem size we generated 10 test images, i.e. 45 pairwise non-trivial transport problems, to get some estimate of the average performance.
Multi-scale and ε-scaling. Each image is then represented by a hierarchy of increasingly finer images of size 2 l × 2 l where l ranges from 3 to n. We refer to l as the layer. This induces a simple quad-tree structure over the images where each pixel at layer l is the parent of four pixels at layer l + 1, see Section 6.4 and [22,23] for more details.
We embed the pixels into R 2 as a Cartesian grid of edge length ∆x n := 1 between two neighbouring pixels at layer l = n and with edge length ∆x l := 2 n−l on coarser layers. At each layer l we start solving with ε = 2 · ∆x 2 l , for which we apply four iterations (two on J A and two on J B ). Then we decrease ε by a factor of two and apply two more iterations (one on J A and one on J B ). We repeat this (so that ε = 0.5 · ∆x 2 l ) and then refine the image. This leaves the scale of c(·, ·)/ε approximately invariant throughout the layers (cf. [23]). At the finest layer, we perform two additional iterations at ε = 0.25, which implies that the influence of entropic regularization in our final solution is rather weak (cf. Figure 10). This means, that for two images of size 2 n × 2 n we only perform (n − 2) · 8 + 2 iterations of the domain decomposition algorithm, even though there are 2 2n−4 basic cells in the X-image (see below) and the graph diameter (Definition 7) is on the order O(2 n−2 ). We observe that this logarithmic number of iterations is fully sufficient (cf. Table 1). This is only possible due to the geometric problem structure (Section 5) and the multi-scale scheme with ε scaling.
Basic and composite partitions. As basic partition we divide each image (at a given layer l) into blocks of s × s pixels where the cell size s is a suitable divisor of 2 l . The composite partition J A is then generated by summarizing the basic cells into 2 × 2 blocks, the partition J B is generated in the same way but with an offset of one basic cell in each direction, i.e. with single basic cells in the corners and 2 × 1 blocks along the image boundaries. This is visualized in Figure 2. Each basic cell at layer l is therefore split into four basic cells at layer l + 1, thus satisfying the assumption from Section 6.4.
The cell size s is an important parameter. For small s the composite cell problems are small (and thus easier), but one must solve a larger number of composite problems, implying more communication overhead in parallelization and requiring more domain decomposition iterations. For large s the number of composite problems is smaller, thus inducing less communication overhead and requiring less iterations, but solving the composite problems becomes more challenging. We found that s = 4 yielded the best results for our code. Studying the influence of this parameter in more detail and reducing the computational overhead in the implementation are a relevant direction for future work.

Visualization of primal iterates
To get an impression of the algorithm's behaviour we visualize the evolution of the primal iterates π ( ) . This is difficult since in our examples the iterates are measures on R 4 . But the partition structure provides some assistance. We assign colors to the basic cells in an alternating pattern (e.g. a checkerboard tiling). Denote by col i the color assigned to basic cell i ∈ I (col i could just be a RGB vector). We can then approximately visualize a coupling π ∈ Π(µ, ν) with partial marginals ν i := P Y π i via the color image on Y generated by i∈I col i · dν i dν . When π is a deterministic coupling, e.g. introduced by a Monge map T : X → Y , then region T (X i ) will be colored with col i , i.e. the image would look like a Cartesian grid that was morphed by the map T . When π is non-deterministic, different ν i can overlap and in these regions the resulting color will be a linear interpolation of the colors col i , weighted by the ν i .
Consider the optimal transport problem between the two images from Figure Figure 9 shows the effect of the first two iterations on the finest layer l = 8. The initial image is essentially uniform, since the initial marginals ν 0 i , as generated with (6.1) on basic cells i ∈ I that were generated by sub-dividing the same coarse basic cell j ∈Î are, up to a constant factor, identical. After the first iteration (on J A ) a grid pattern reemerges, after the second iteration (on J B ), it is hard to spot further changes (cf. Figure 10, where for ε = 2 the iterate after four iterations is shown).
The effect of ε-scaling within a given layer is visualized in Figure 10. For ε = 2 the checkerboard pattern is blurred between neighbouring cells. For ε = 0.25, neighbouring cells overlap by at most one pixel, which would also be true in the discrete unregularized setting (recall that our pixels do not all carry the same mass, so the discrete unregularized optimal coupling is usually not deterministic). This demonstrates that the algorithm can be run until very low regularization parameters.
Finally, Figure 11 demonstrates the evolution of the multi-scale scheme. It shows the final coupling on each layer for the smallest ε used on that layer, just before refinement (except for the finest layer). One can trace how increasingly finer structures of the optimal coupling emerge as the resolution improves.

Quantitative evaluation
Having convinced ourselves by means of visualizations that the entropic domain decomposition algorithm seems to work as intended, we now turn to a more quantitative analysis of its performance. A comparison to the performance of a single sparse multi-scale Sinkhorn algorithm [23] is provided in Section 7.4.

Runtime.
We measure the runtime of the algorithm on different problem sizes, with the number of worker threads ranging from 1 to 5. For simplicity, for a single worker thread, the implementation still uses two threads (one master, one worker), but the performance cannot be distinguished from a true single-thread implementation, since the master thread is essentially idle in this case. In addition we are interested in the time required for solving the final (i.e. finest) layer, and for the times required by the main sub-routines, i.e. the Sinkhorn solver, the measure balancing, truncation and refinement. These are listed in Table 1 and the total runtimes are visualized in Figure 12.
We find that the runtime scales approximately linearly (or at most slightly super-linear) Table 1: Summary of average performance (with standard deviation) of the domain decomposition algorithm and a single sparse Sinkhorn algorithm [23] for comparison. For each image size results are averaged over 45 problem instances (Section 7.1). The relative PD gap and the relative dual score are defined in (7.1) and (7.2).
with the size of the marginals which is crucial for solving large problems. The time required for solving the final layer comprises approximately 80% of the total time, indicating that even with a good initialization, solving the finest layer is a challenging problem. The time spent on the Sinkhorn sub-solver for the composite cell problems accounts for approximately 96% of the runtime for large problems, which means that other functions cause relatively little overhead. On small problems these ratios are a bit lower, since the overhead of the multi-scale scheme is more significant. For large problems the runtime is approximately proportional to the inverse number of workers (up to five workers), indicating that parallelization is computationally efficient. Again, for small problems, computational overhead becomes more significant, such that the benefit of parallelization saturates earlier. For large problems we estimate that MPI communication accounts for approximately 3% of time in a domain decomposition iteration.
Sparsity. In addition to runtime the memory footprint of the algorithm is an important criterion for practicability. It is dominated by storing the list (ν i ) i∈I (after truncation at 10 −15 , see Section 6.1), since only a few small sub-problems are solved simultaneously at any given time. We are therefore interested in the number of non-zero entries in this list throughout the algorithm. This number is usually maximal on the finest layer (most pixels) and for the highest value of ε on that layer (with strongest entropic blur), which therefore constitutes the memory bottleneck of the algorithm.
This maximal number and the final number of entries are listed in Table 1 and visualized in Figure 13. We find that both scale approximately linear with marginal size. Again, this is   Solution quality. The data reported in Table 1 also confirms that the algorithm provides high-quality approximate solutions of the problem. The relative primal-dual (PD) gap is defined as (see (2.9) for the definition of J) i.e. we divide the primal-dual gap by the primal score (where, for simplicity we subtract the constant term K ). The dual variables (u, v) are obtained from the composite duals by gluing, as discussed in Section 6.3. We find that this is on the order of 10 −4 , i.e. a high relative accuracy. The L 1 -error of the X-marginal of the primal iterate is consistently even smaller than the prescribed bound Err = 10 −4 (cf. Section 6.2). This happens since we do not check the stopping criterion for the Sinkhorn sub-solver after every iteration. The L 1 -error of the Y -marginal is not (numerically) zero, since a small error accumulates over time during the several additions and subtractions of the partial marginals and in the BalanceMeasures function. But since it is still substantially below the X-error we do not consider this to be a problem. Additional stabilization-routines could be added to the algorithm, should it become necessary.

Comparison to single Sinkhorn algorithm
As a reference we compare the performance of the domain decomposition algorithm to that of a single Sinkhorn algorithm (with adaptive kernel truncation, coarse-to-fine scheme and ε-scaling as in [23]). The results are also summarized in Table 1, and visualized in Figures 12 and 13. The single Sinkhorn algorithm was limited to sizes up to 512 × 512 due to numerical stability and memory issues (see below for details). Both algorithms have several parameters that influence the trade-off between accuracy and runtime. One important parameter is the stopping criterion for the sub-solver Sinkhorn algorithm in the former, and for the global single Sinkhorn algorithm in the latter. Another one is the truncation threshold for partial marginals (cf. Section 6.1) in the former, and for the (stabilized) kernel matrix in the latter [23,Section 3.3].
Stopping criterion. The stopping criterion for the domain decomposition solver was picked such that the global L 1 -error of the marginals is below Err = 10 −4 (cf. Sections 6.2 and 7.1). We observe that if we pick the same criterion for the single Sinkhorn solver, it takes much longer than the domain decomposition method (even without parallelization) and the runtime scales significantly super-linearly with marginal size.
A more detailed analysis reveals that the single Sinkhorn algorithm spends a major part of time on the largest ε on the finest layer, i.e. immediately after the last refinement. We conjecture that this is due to the fact that the single Sinkhorn algorithm represents the primal iterates only implicitly through the dual iterates (u, v) via (u⊗v)·K (cf. Section 2.2). During layer-refinement only the dual variables are refined explicitly. This can introduce quite non-local marginal errors in the primal iterate (i.e. mass must be moved far to correct them) that take many iterations to correct. We tried various heuristics to reduce this effect (e.g. more sophisticated interpolation of the dual variables during refinement) but unfortunately without success.
In contrast, the domain decomposition method keeps explicitly track of primal and dual (partial) iterates, both of which are refined separately during layer transitions. In particular the total mass of the refined primal iterate within each refined basic cell is correct after refinement (cf. (6.1) and Proposition 5) and thus fewer global mass movements are required for correcting these errors.
Therefore, as in [23], numerically we use the L ∞ error as stopping criterion for the single Sinkhorn algorithm. We set the stopping accuracy to 10 −7 , such that (without parallelization) both algorithms have approximately the same runtime.
Sparsity. To perform its iterations the single Sinkhorn algorithm must keep track of a truncated (stabilized) kernel matrix, i.e. one column of non-zero entries for each pixel of the first marginal. Following [23, Section 3.3, Equation (3.8)] we keep entries of the (stabilized) kernel matrix where (u ⊗ v) · k ≥ θ with the choice θ = 10 −10 .
The domain decomposition method (in the form of Algorithm 2) only needs to store a column of non-zero entries for each basic cell (of size s × s) in the first marginal. Thus, to represent the primal iterate with comparable accuracy, the number of non-zero entries required in the domain decomposition method is reduced by a factor s −2 .
In this sense, for the parameter choices in our experiments, the domain decomposition algorithm represents the primal iterates slightly more accurately. At the same time, the number of entries to store the final iterate is reduced to approximately 15% compared to the single Sinkhorn algorithm. The maximum number of entries is even reduced to approximately 11%. On problem size 2048 × 2048 the single Sinkhorn algorithm ran out of memory on our test machine.
Solution quality. As discussed above, to obtain similar runtimes for both algorithms, a less strict stopping criterion had to be chosen for the single Sinkhorn algorithm. Therefore, it produces higher L 1 -marginal errors for the X-marginal that increase with problem size. The Y -marginal error is (numerically) zero, since the algorithm was terminated after a Y -iteration (cf. Remark 1). The relative dual score reported in Table 1 is defined by cf. (7.1), where (u domdec , v domdec ) refers to the (glued) final dual variables of the domain decomposition method (Section 6.3) and (u single , v single ) to the final dual iterates of the single Sinkhorn algorithm. The fact that this number is negative indicates that the domain decomposition provides better dual candidates than the single Sinkhorn.
Numerical stability. For small regularization the standard Sinkhorn algorithm becomes numerically unstable due to the finite precision of discretized numbers on computers. This can be remedied to a large extent by the absorption technique [23]. But refinement between layers remains a delicate point since the interpolated dual variables from the coarser layer may not provide a perfect initial guess at the finer layer. This can lead to numerically zero or empty columns and rows in the (stabilized) kernel matrix and thus will lead to invalid entries in the scalings (u, v). While it is possible to fix these cases, on large problems this is relatively cumbersome and computationally inefficient. We therefore decided to abort the algorithm in these cases instead. In our experiments this concerned 3 of the 45 examples of size 512 × 512 and the majority of the 1024 × 1024 examples (which we have therefore removed from the comparison). These issues may also arise in the domain decomposition algorithm but there they are localized to small problems on composite cells which are easy to fix (e.g. by temporarily increasing ε). We have implemented a corresponding safe-guard. In all our examples it was invoked on two 1024 × 1024 instances.
Summary. Clearly the domain decomposition method is superior to the single Sinkhorn algorithm in various aspects. In comparison, the domain decomposition method (without parallelization) produces more accurate primal and dual iterates in comparable time, with approximately 11% of the memory. It runs more reliably on large problems, since numerical hickups of the Sinkhorn algorithm can be fixed locally. On top of that the domain decomposition method allows for real non-local parallelization, whereas the Sinkhorn algorithm can only be parallelized very locally over the matrix-vector multiplications.

Conclusion and outlook
In this article we studied Benamou's domain decomposition algorithm for optimal transport in the entropy regularized setting. The key observation is that the regularized version converges to the unique globally optimal solution under very mild assumptions. We proved linear convergence of the algorithm w.r.t the KL-divergence and illustrated the (potentially very slow) rates with numerical examples.
We then argued that on 'geometric problems' the algorithm should converge much faster. To confirm this experimentally, we discussed a practical, efficient version of the algorithm with reduced memory footprint, numerical safeguards for approximation errors, primal-dual certificates and a coarse-to-fine scheme. With this we were able to compute optimal transport between images with ≈ 4 megapixels on a standard desktop computer, matching two megapixel images in approximately four minutes. Even without parallelization, the algorithm compared favourably to a single Sinkhorn algorithm in terms of runtime, memory, solution quality and numerical reliability. With parallelization the runtime was efficiently reduced with little communication overhead.
A practical open question for future research is the development of more sophisticated and efficient implementations, e.g. using GPUs and/or multiple machines, with more sophisticated parallelization structure (e.g. no single unit needs to keep track of the full problem). This should then be tested on 3D problems and with more general cost functions. In the course of this, parameters such as the basic cell size should be studied in more detail.
On the theoretical side the convergence analysis on 'geometric problems' is an interesting challenge, to provide a more thorough understanding for the low number of required iterations. We conjecture that a 'discrete Helmholtz decomposition' as discussed in Section 6.3 may become relevant for this. The relation between optimal transport and the Helmholtz decomposition was already noted in [4] and exploited numerically in [12].

A Appendix
Proof of Proposition 1.
Proof of Proposition 2.
(i). Since u ∈ L ∞ + (X, µ), log u is also µ-essentially bounded from above, thus the first integral is either finite or −∞. The same holds for the second term. The third and fourth terms of J are always finite. Hence, the value of J is unambiguously defined and contained in [−∞, ∞).
Here we used that J(u, v) > −∞ implies u > 0μ-a.e. and v > 0ν-a.e., so π ∈ Π(μ,ν) has no mass in the region of the second integral (where u ⊗ v = 0). The integrand in the first integral is non-negative by the Fenchel-Young inequality and we added the second (vanishing) integral to emphasize that all contributions were considered.