A note on overrelaxation in the Sinkhorn algorithm

We derive an a priori parameter range for overrelaxation of the Sinkhorn algorithm, which guarantees global convergence and a strictly faster asymptotic local convergence. Guided by the spectral analysis of the linearized problem we pursue a zero cost procedure to choose a near optimal relaxation parameter.


Introduction and statement of result
The Sinkhorn algorithm is the benchmark approach to fast computation of the entropic regularization of optimal transportation [4].Ultimately, one is faced with the following numerical problem: Given two probability vectors a ∈ R m + , b ∈ R n + and a matrix K ∈ R m×n + , the goal is to find a pair of vectors (u, v) ∈ R m where x • y denotes the componentwise multiplication (Hadamard product) of vectors of equal dimension.Here R + refers to the positive reals.We assume min(m, n) ≥ 2.
In the standard Sinkhorn algorithm an approximating sequence (u , v ) starting from an initial vector v 0 ∈ R n + is constructed via the update rule where x y denotes the componentwise division of vectors of equal dimension.It is a classic result by Sinkhorn [16] that for any initial point v 0 ∈ R n + the algorithm converges to a solution (u * , v * ) of (1), which is unique modulo rescaling (tu * , t −1 v * ), t > 0.Moreover, the convergence, e.g. of suitably normalized iterates u / u and v / v , or using other equivalent distance measures like the Hilbert metric, is R-linear with an asymptotic rate at least Λ(K) 2 , where Λ(K) < 1 is the Birkhoff contraction ratio defined in (8) further below [8].See also [15] for an overview.
In this note we discuss a modified version of the Sinkhorn algorithm employing relaxation, which was recently proposed in [18] and [14].It uses the update rule where ω > 0 is are suitably chosen relaxation parameter, and exponentiation is understood componentwise.In a log-domain formulation such as (7) further below, the relation to the classic concept of relaxation in (nonlinear) fixed point iterations will become immediately apparent.Note that the iteration (2) still has the solution of (1) as its unique (modulo scaling) fixed point.As illustrated in [18] and [14], choosing the parameter ω larger than one can significantly accelerate the convergence speed compared to the standard Sinkhorn method, which sometimes can be slow.For optimal transport, such an improvement could be in particular relevant in the regime of small regularization, or when a high target precision is needed, such as in applications in density functional theory [3].While global convergence for ω = 1 is not obvious anymore, local convergence of the modified method is ensured for all 0 < ω < 2, and the asymptotically optimal relaxation parameter can be determined from its linearization at a fixed point (u * , v * ).In logarithmic coordinates, the linearization of the standard Sinkhorn method has the iteration matrix where The local convergence rate equals the second largest eigenvalue 0 ≤ ϑ 2 < 1 of that matrix; see [11].Note that M has real and nonnegative eigenvalues since it is similar to a positive semidefinite matrix, and its largest eigenvalue equals one (the eigenvector having constant entries), which accounts for the scaling indeterminacy in the problem formulation.For the modified method with relaxation, the local rate is also related to ϑ 2 , which has been worked out in [18] and is summarized in the following theorem.For convenience, we provide a brief outline how this result can be obtained at the end of section 2.
Theorem 1 (cf.[18]).Assume ϑ 2 > 0. For all choices of 0 < ω < 2 the modified Sinkhorn algorithm (2) is locally convergent in some neighborhood of (u * , v * ).Its asymptotic (R-linear) convergence rate is where It holds ρ ϑ (ω) < 1 for all 0 < ω < 2, and ω opt provides the minimal possible rate (independent of the starting point) on that interval, namely By the above theorem, the optimal relaxation parameter ω opt is always larger than one (if ϑ 2 > 0).In fact, by the exact formula (4) for the convergence rate, the range of ω for which the modified method is asymptotically strictly faster than the standard Sinkhorn method, that is, ρ ϑ (ω) < ϑ 2 = ρ ϑ (1), is precisely the interval However, the value of ϑ 2 depends on the solution and is therefore not known in advance.To deal with this problem, an adaptive procedure for choosing ω is proposed in [18].As our contribution, the main goal in this note is to provide an a priori interval for the relaxation parameter ω for which the modified iteration is both globally convergent and locally faster than the standard Sinkhorn method.In Theorem 3 we first prove global convergence of the modified method for parameters in the interval 0 < ω < 2 1+Λ(K) .In Theorem 4 we then provide an a priori lower bound ϑ 2 ≥ δ K,a,b > 0, which depends only on the data of the problem, but requires a full rank assumption on K.By (6), any ω ∈ (1, 1 + δ K,a,b ) then satisfies ρ ϑ (ω) < ϑ 2 .Taken together this yields the following result.
Theorem 2. Assume rank(K) = min(m, n) ≥ 2. For any 1 < ω < 1 + ϑ 2 the asymptotic local convergence rate of the modified Sinkhorn method (2) is faster than for the standard Sinkhorn method.For the modified method is both globally convergent and asymptotically faster than the standard method.
We remark that our derived a priori interval for ω is usually very small, and hence our result is of rather theoretical interest.In the relevant cases, when ϑ 2 is close to one, significant acceleration is achieved only when ω is close to ω opt (which tends to two for ϑ 2 → 1).A possible heuristic to select a nearly optimal relaxation is to approximate the second largest eigenvalue of M based on the current iterate.After a similarity transform, this requires to compute the spectral norm of a symmetric matrix.An even simpler approach, as suggested in [18], is to directly estimate ϑ 2 , and hence ω opt , by monitoring the convergence rate of the standard Sinkhorn method in terms of a suitable residual.In the final section 4 we include numerical illustrations, which indicate that in certain cases such heuristics can be quite precise already in the initial phase of the algorithm, resulting in the almost optimal convergence rate at almost no additional cost.This confirms that overrelaxation is a simple way to significantly accelerate the Sinkhorn method in cases where it is slow.For completeness, we should mention that alternative approaches for solving problem ( 1) and aiming at fast convergence have been proposed based on Newton's method, see, e.g., [12,2] and references therein.
The convergence analysis of the Sinkhorn method is usually carried out in a log-domain formulation [15].We choose the closely related framework of compositional data space used, e.g., in statistics [13], which we think could be of independent interest in this context.In this space, which is introduced in the next section, the Sinkhorn algorithm with a positive matrix K reads as a nonlinear fixed point iteration for an essentially contractive iteration function, as is known from the Birkhoff-Hopf theorem.The main results are then presented in Section 3. Let us note that the assumption that K has strictly positive entries is not essential for all of the results.While global convergence of the standard Sinkhorn method to a unique (up to scaling) positive solution (u, v) of ( 1) can be shown under several weaker assumptions, most notably when a = b = 1 and K is square, nonnegative and has total support [17], we require the global contractivity of the process in Hilbert metric (which holds for positive K) in our proof that global convergence can still be ensured for some ω > 1 (Theorem 3).The idea of accelerating convergence by overrelaxation, on the other hand, is very general and the local spectral analysis provided by Theorem 1 applies whenever the iteration (2) is locally well defined around a (positive) fixed point (u * , v * ) and ϑ 2 < 1. Correspondingly, Theorem 4 on a lower bound for ϑ 2 does not require K to be positive.Hence one has guaranteed acceleration of local convergence for 1 < ω < 1 + δ K,a,b in several scenarios where K is only nonnegative.

Formulation in compositional data space
The problem (1) as well as the Sinkhorn algorithm and its modified variant inherit a natural scaling indeterminacy of the variables u and v.It can be therefore formulated in a suitable equivalence space.Here we recast the algorithm in the framework of what is called compositional data space; see, e.g., [13,1].To this aim, let The resulting equivalence class of x will be denoted by x.One specifies a vector addition and a scalar multiplication on C m via where x γ has the components (x γ 1 , . . ., x γ n ).As a result (C m , +, •) becomes a real vector space of dimension m − 1.In this space we consider the so called Hilbert norm Note that this norm on the equivalence classes coincides with the well-known Hilbert distance on the representatives: Similarly we construct a Banach space C n = R n + / ∼.The modified Sinkhorn algorithm (2) can be interpreted as an iteration in the space C m × C n and reads where K : C n → C m and K T : C m → C n are now the nonlinear maps given by The convergence of the standard Sinkhorn algorithm (ω = 1) is based on a famous result of Birkhoff and Hopf on the contractivity of K and K T .To state it, define the quantities Then the following holds; for a proof, see, e.g., [5, Theorems 3.5 & 6.2].
Theorem (Birkhoff-Hopf).For any Note that Λ(K) = Λ(K T ) < 1.As a result, both K and K T are contractive maps in the Hilbert norm with Lipschitz constant Λ(K), which is also called the Birkhoff contraction ratio of K. Based on this, it is not difficult to establish the global convergence of the standard Sinkhorn algorithm in the space C m × C n at a rate O(Λ(K) 2 ).
It is important to emphasize that studying the convergence in C m × C n , that is, convergence of equivalence classes, is sufficient for understanding the method in R m + × R n + .Indeed, a pair (u * , v * ) is a fixed point of (7) if and only if for any choice of representatives (u * , v * ) there exist λ, µ such that λu • Kv = a and µv • K T u = b.From 1 T m a = 1 T n b (here 1 denotes a vector of all ones) it follows that λ = µ, and hence, e.g.u + := λ −1/2 u * and v + := λ −1/2 v * solve the initial problem (1), where λ = u T Kv.Moreover, choosing representatives (u , v ) of the iterates (u , v ) such that 1 T m u = 1 T n v = 1, and setting u + := λ 2 v with λ = u T Kv , yields a sequence which converges exponentially fast to (u * , v * ).
We now briefly outline how the local convergence analysis for ( 7) can be conducted [18], leading to Theorem 1.By combining both steps of the iteration (7) into a nonlinear fixed point iteration (u +1 , v +1 ) = F(u , v ) in the space C m × C n , one finds that its derivative at the fixed point (u * , v * ) takes the form where Matrices of the form M ω are well known as error iteration matrices of block SOR methods for linear systems.The spectral radius of M ω can be computed exactly from formula (4), if the spectral radius ϑ of L + U is known; see [19, Sec.

Main results
We prove the global convergence of the modified method for a range of values ω larger than one.
Proof.Starting from (7), using the triangle inequality and the contractivity of K and K T provided by the Birkhoff-Hopf theorem, we obtain As a consequence, for ∆u : and the vector inequality is understood entry-wise.Since all involved quantities are non-negative the inequality can be iterated, which gives ∆u +1 ∆v +1 ≤ (T ω ) +1 ∆u 0 ∆v 0 .
Hence, to prove exponential convergence it suffices to show that the spectral radius of T ω is strictly less than one.Since the spectral radius equals Next we provide a lower bound for the second largest eigenvalue ϑ 2 of the matrix M in (3), which by ( 6) then yields an interval for ω such that the modified method has a strictly faster asymptotic convergence rate than the standard Sinkhorn method.
where σ min (K) is the smallest positive singular value of K, K ∞ = max v ∞=1 Kv ∞ , and the subscripts min, max denote the smallest and largest entry of the corresponding vector.Then it holds Note that for a positive matrix and vice versa if m ≤ n.Hence δ K,a,b is indeed smaller than one, which is in line with the bound ϑ 2 ≤ Λ(K) 2 .
Proof.We consider the case m ≥ n.Instead of matrix M we consider the positive semidefinite matrix which is obtained from M by a similarity transformation (and using (1)).Since the dominant eigenvector of H (with eigenvalue one) is a 1/2 , we have By projecting on the orthogonal complement of a 1/2 , and noting that a 1/2 2 = 1, we first rewrite this as where the maximum is taken over all w that are not collinear to a 1/2 .For such w the numerator is always nonnegative and the denominator is positive.Next we substitute with a new variable z.This yields where the maximum is taken over all z not collinear with u * (the numerator is then nonnegative and the denominator is positive).To obtain a lower bound, we now evaluate the expression at z satisfying K T z = e j where e j denotes the j-th unit vector.Note that such z exists (K T has full row rank) and is indeed not collinear to u * , since otherwise K T u * would be collinear with e j , which contradicts Therefore, using this z, we get We can choose j as the position of a largest entry of the vector v * .Then in the denominator This leads to the asserted lower bound ϑ 2 ≥ δ 1 .
When m ≤ n, we can simply interchange the roles of K and K T , a and b, as well as u * and v * in this proof to obtain ϑ 2 ≥ δ 2 .
Taken together, Theorem 3 and Theorem 4 result in Theorem 2.

Numerical illustration
We illustrate the effect of overrelaxation by two numerical experiments related to optimal transport.The first is motivated by an application to color transfer between images [6].The matrix K = K ε is generated as ε where x i , y j ∈ R 3 are RGB values (scaled to [0, 1]) of m = n = 1000 randomly sampled pixels in two different color images, respectively. 1The vectors a and b are chosen as uniform distributions, i.e. a = 1 m /m and b = 1 n /n.We choose ε = 0.01.In this scenario the standard Sinkhorn method is reasonably fast, but still can be accelerated using overrelaxation.A typical outcome for different relaxation strategies is shown in Fig. 1 left, where we plot for 500 iterations the 1 -distance P − P * 1 between the matrices P = diag(u )K diag(v ) and a numerical reference solution P * = diag(u * )K diag(v * ).This error corresponds to the total variation distance of the corresponding transport plan.Even if this quantity (specifically P * ) is not available in a practical computation it is a natural measure for the convergence of the method.Besides the standard Sinkhorn method (ω = 1), we run the method with a fixed relaxation ω = 1.5, and with the 'optimal' relaxation ω opt , which is computed via formula (5) from the second largest singular value ϑ of matrix diag(1/a 1/2 )P * diag(1/b 1/2 ) (then ϑ 2 is the second largest eigenvalue of (3)).We do not consider relaxation based on the lower bound on ϑ 2 in Theorem 4, since the resulting ω is too close to one.In all variants of the algorithm the same (uniformly) random starting vectors u 0 and v 0 are used.As can be seen, using ω opt significantly accelerates the convergence speed.Moreover, although ω opt only provides the optimal local rate, the positive effect shows quite immediately.However, the value of ω opt is a priori unknown in practice.Therefore we also tested a simple heuristic, similar to one suggested in [18].It is known that the convergence of the Sinkhorn method can be monitored, e.g., through the error P 1 n − a 1 ; cf.[15,Remark 4.14].Therefore, since ϑ 2 equals the asymptotic convergence rate of the standard Sinkhorn method, we may take θ2 as a current approximation for ϑ 2 .In the purple curve (diamond markers) in Fig. 1 left, we updated ω a single time after 20 steps of the standard method based on this quantity, and using formula (5).This comes at almost no additional cost, but yields the near optimal rate in this example.Of course such a heuristic could be applied in a more systematic way, e.g., by monitoring the changes of for a suitable value of p over several iterations.We note that adapting ω in (linear and nonlinear) SOR methods based on currently observed convergence rates is a classical idea and has been proposed, e.g., in [19] or [10].
As a second example we consider a 1D transport problem between two random measures a and b (generated from a uniform distribution) on an equidistant grid in [0, 1], and with 1 -norm as a cost.The matrix K in this case is given as Again we choose m = n = 1000 and ε = 0.01, and then compare different relaxation strategies, but starting from the same random intitialization (u 0 , v 0 ).As can be seen in Fig. 1 right, which shows 500 iterations with different relaxation strategies, this problems seems to be more difficult and the standard Sinkhorn method is extremely slow.A suitable relaxation compensates this and restores fast convergence, however, as illustrated by the slow convergence of the curve for ω = 1.5, the estimation of ω opt , and hence of ϑ 2 , needs to be rather precise.Since here the convergence rate of the standard method stabilizes later, we apply the above heuristic of estimating ϑ 2 only after 200 iterations of the standard iteration, resulting in the purple curve (diamond markers).The oscillatory behavior occurs because ω is estimated larger than ω opt , in which case the spectral radius ω − 1 of the linearized iteration matrix M ω in ( 9) is achieved at complex eigenvalues.It is possible in this example to update ω earlier using computationally more expensive heuristics.For instance, the green curve (triangle markers) is obtained by computing after 50 iterations of the standard method an approximation of ϑ as the second largest singular value of the matrix diag(1/a 1/2 )P diag(1/b 1/2 ), where a = u • Kv and b = v • K T u .This could be done iteratively, we used the Matlab function svds.This results in an almost optimal convergence rate in this example.Of course, several similar strategies could be devised.

6 . 2 ]
or[9, Thm.4.27].The eigenvalues of L + U , however, are square roots of the eigenvalues of the composition of derivatives K (v * )K T (u * ), which is a linear map on C m .It remains to show that the largest eigenvalue of that operator is precisely the second largest eigenvalue of the matrix M in (3).Indeed, by elementary calculations, M is the matrix representation of K (v * )K T (u * ) under the isomorphism u → exp(u) between the subspace {u ∈ R m : 1 T m u = 0} ⊆ R m and C m .

Figure 1 :
Figure 1: Effect of different relaxation strategies in two examples.