1 Introduction and statement of result

The Sinkhorn algorithm is the benchmark approach to fast computation of the entropic regularization of optimal transportation [1]. Ultimately, one is faced with the following numerical problem: Given two probability vectors \(a \in \mathbb {R}_{+}^m\), \(b \in \mathbb {R}^{n}_{+}\) and a matrix \(K \in \mathbb {R}^{m \times n}_{+}\), the goal is to find a pair of vectors \((u,v) \in \mathbb {R}_{+}^m \times \mathbb {R}_{+}^n\) such that

$$\begin{aligned} u {}\circ {}Kv = a \quad \text{ and } \quad v {}\circ {}K^\mathsf Tu = b, \end{aligned}$$
(1)

where \(x {}\circ {}y\) denotes the componentwise multiplication (Hadamard product) of vectors of equal dimension. Here \(\mathbb {R}_+\) refers to the positive reals. We assume \(\min (m,n) \ge 2\).

In the standard Sinkhorn algorithm an approximating sequence \((u_\ell ,v_\ell )\) starting from an initial vector \(v_0 \in \mathbb {R}_+^n\) is constructed via the update rule

$$\begin{aligned} u_{\ell + 1} = \frac{a}{Kv_{\ell }}, \qquad v_{\ell +1} = \frac{b}{K^\mathsf Tu_{\ell +1}}, \end{aligned}$$

where \(\frac{x}{y}\) denotes the componentwise division of vectors of equal dimension. It is a classic result by Sinkhorn [2] that for any initial point \(v_0 \in \mathbb {R}^n_+\) the algorithm converges to a solution \((u^*,v^*)\) of (1), which is unique modulo rescaling \((t u^*, t^{-1} v^* )\), \(t >0\). Moreover, the convergence, e.g. of suitably normalized iterates \(u_\ell / \Vert u_\ell \Vert \) and \(v_\ell / \Vert v_\ell \Vert \), or using other equivalent distance measures like the Hilbert metric, is R-linear with an asymptotic rate at least \(\Lambda (K)^2\), where \(\Lambda (K) < 1\) is the Birkhoff contraction ratio defined in (8) further below [3]. See also [4] for an overview.

In this note we discuss a modified version of the Sinkhorn algorithm employing relaxation, which was recently proposed in [5, 6]. It uses the update rule

$$\begin{aligned} u_{\ell + 1} = u_\ell ^{1-\omega } {}\circ {}\left( \frac{a}{ K v_\ell }\right) ^{\omega }, \qquad v_{\ell + 1} = v_\ell ^{1-\omega } {}\circ {}\left( \frac{b}{ K^\mathsf Tu_{\ell + 1} }\right) ^{\omega }, \end{aligned}$$
(2)

where \(\omega > 0 \) is are suitably chosen relaxation parameter, and exponentiation is understood componentwise. In a log-domain formulation such as (7) further below, the relation to the classic concept of relaxation in (nonlinear) fixed point iterations will become immediately apparent. Note that the iteration (2) still has the solution of (1) as its unique (modulo scaling) fixed point. As illustrated in [5, 6], choosing the parameter \(\omega \) larger than one can significantly accelerate the convergence speed compared to the standard Sinkhorn method, which sometimes can be slow. For optimal transport, such an improvement could be in particular relevant in the regime of small regularization, or when a high target precision is needed, such as in applications in density functional theory [7].

While global convergence for \(\omega \ne 1\) is not obvious anymore, local convergence of the modified method is ensured for all \(0< \omega < 2\), and the asymptotically optimal relaxation parameter can be determined from its linearization at a fixed point \((u_*,v_*)\). In logarithmic coordinates, the linearization of the standard Sinkhorn method has the iteration matrix

$$\begin{aligned} M = {{\,\mathrm{diag}\,}}\left( \frac{1}{a} \right) P_* {{\,\mathrm{diag}\,}}\left( \frac{1}{b} \right) P_*^\mathsf T, \qquad \text {where} \quad P_* = {{\,\mathrm{diag}\,}}(u^*) K {{\,\mathrm{diag}\,}}(v^*). \end{aligned}$$
(3)

The local convergence rate equals the second largest eigenvalue

$$\begin{aligned} 0 \le \vartheta ^2 < 1 \end{aligned}$$

of that matrix; see [8]. Note that M has real and nonnegative eigenvalues since it is similar to a positive semidefinite matrix, and its largest eigenvalue equals one (the eigenvector having constant entries), which accounts for the scaling indeterminacy in the problem formulation. For the modified method with relaxation, the local rate is also related to \(\vartheta ^2\), which has been worked out in [5] and is summarized in the following theorem. For convenience, we provide a brief outline how this result can be obtained at the end of Sect. 2.

Theorem 1

(cf. [5]) Assume \(\vartheta ^2 > 0\). For all choices of \(0< \omega < 2\) the modified Sinkhorn algorithm (2) is locally convergent in some neighborhood of \((u^*,v^*)\). Its asymptotic (R-linear) convergence rate is

$$\begin{aligned} \rho _\vartheta (\omega ) := {\left\{ \begin{array}{ll} \frac{1}{4}\left( \omega \vartheta + \sqrt{\omega ^2 \vartheta ^2 - 4(\omega -1 )} \right) ^2, \quad &{}\text {if }\quad 0< \omega \le \omega ^{\mathrm{opt}}, \\ \omega - 1, \quad &{}\text {if}\quad \omega ^{\mathrm{opt}} \le \omega < 2, \end{array}\right. } \end{aligned}$$
(4)

where

$$\begin{aligned} \omega ^{\mathrm{opt}} = \frac{2}{1+\sqrt{1 - \vartheta ^2}} > 1. \end{aligned}$$
(5)

It holds \(\rho _\vartheta (\omega ) < 1\) for all \(0< \omega < 2\), and \(\omega ^{\mathrm{opt}}\) provides the minimal possible rate (independent of the starting point) on that interval, namely

$$\begin{aligned} \rho ^{{\mathrm{opt}}} = \omega ^{\mathrm{opt}} - 1 = \frac{1 - \sqrt{1 - \vartheta ^2}}{1 + \sqrt{1 - \vartheta ^2}} < \vartheta ^2. \end{aligned}$$

By the above theorem, the optimal relaxation parameter \(\omega ^{\mathrm{opt}}\) is always larger than one (if \(\vartheta ^2 > 0\)). In fact, by the exact formula (4) for the convergence rate, the range of \(\omega \) for which the modified method is asymptotically strictly faster than the standard Sinkhorn method, that is, \(\rho _{\vartheta }(\omega ) < \vartheta ^2 = \rho _{\vartheta }(1)\), is precisely the interval

$$\begin{aligned} 1< \omega < 1 + \vartheta ^2. \end{aligned}$$
(6)

However, the value of \(\vartheta ^2\) depends on the solution and is therefore not known in advance. To deal with this problem, an adaptive procedure for choosing \(\omega \) is proposed in [5].

As our contribution, the main goal in this note is to provide an a priori interval for the relaxation parameter \(\omega \) for which the modified iteration is both globally convergent and locally faster than the standard Sinkhorn method. In Theorem 3 we first prove global convergence of the modified method for parameters in the interval \(0< \omega < \frac{2}{ 1+\Lambda (K)}\). In Theorem 4 we then provide an a priori lower bound \(\vartheta ^2 \ge \delta _{K,a,b} > 0\), which depends only on the data of the problem, but requires a full rank assumption on K. By (6), any \(\omega \in (1, 1 + \delta _{K,a,b})\) then satisfies \(\rho _\vartheta (\omega ) < \vartheta ^2\). Taken together this yields the following result.

Theorem 2

Assume \({{\,\mathrm{rank}\,}}(K) = \min (m,n) \ge 2\). For any \(1< \omega < 1 + \vartheta ^2\) the asymptotic local convergence rate of the modified Sinkhorn method (2) is faster than for the standard Sinkhorn method. For \(1< \omega < \min \left( 1 + \delta _{K,a,b}, \frac{2}{1 + \Lambda (K)} \right) \) the modified method is both globally convergent and asymptotically faster than the standard method.

We remark that our derived a priori interval for \(\omega \) is usually very small, and hence our result is of rather theoretical interest. In the relevant cases, when \(\vartheta ^2\) is close to one, significant acceleration is achieved only when \(\omega \) is close to \(\omega ^{{\mathrm{opt}}}\) (which tends to two for \(\vartheta ^2 \rightarrow 1\)). A possible heuristic to select a nearly optimal relaxation is to approximate the second largest eigenvalue of M based on the current iterate. After a similarity transform, this requires to compute the spectral norm of a symmetric matrix. An even simpler approach, as suggested in [5], is to directly estimate \(\vartheta ^2\), and hence \(\omega ^{{\mathrm{opt}}}\), by monitoring the convergence rate of the standard Sinkhorn method in terms of a suitable residual. In the final Sect. 4 we include numerical illustrations, which indicate that in certain cases such heuristics can be quite precise already in the initial phase of the algorithm, resulting in the almost optimal convergence rate at almost no additional cost. This confirms that overrelaxation is a simple way to significantly accelerate the Sinkhorn method in cases where it is slow. For completeness, we should mention that alternative approaches for solving problem (1) and aiming at fast convergence have been proposed based on Newton’s method, see, e.g., [9, 10] and references therein.

The convergence analysis of the Sinkhorn method is usually carried out in a log-domain formulation [4]. We choose the closely related framework of compositional data space used, e.g., in statistics [11], which we think could be of independent interest in this context. In this space, which is introduced in the next section, the Sinkhorn algorithm with a positive matrix K reads as a nonlinear fixed point iteration for an essentially contractive iteration function, as is known from the Birkhoff–Hopf theorem. The main results are then presented in Sect. 3. Let us note that the assumption that K has strictly positive entries is not essential for all of the results. While global convergence of the standard Sinkhorn method to a unique (up to scaling) positive solution (uv) of (1) can be shown under several weaker assumptions, most notably when \(a = b = \mathbf {1}\) and K is square, nonnegative and has total support [12], we require the global contractivity of the process in Hilbert metric (which holds for positive K) in our proof that global convergence can still be ensured for some \(\omega > 1\) (Theorem 3). The idea of accelerating convergence by overrelaxation, on the other hand, is very general and the local spectral analysis provided by Theorem 1 applies whenever the iteration (2) is locally well defined around a (positive) fixed point \((u^*,v^*)\) and \(\vartheta ^2 < 1\). Correspondingly, Theorem 4 on a lower bound for \(\vartheta ^2\) does not require K to be positive. Hence one has guaranteed acceleration of local convergence for \(1< \omega < 1 + \delta _{K,a,b}\) in several scenarios where K is only nonnegative.

2 Formulation in compositional data space

The problem (1) as well as the Sinkhorn algorithm and its modified variant inherit a natural scaling indeterminacy of the variables u and v. It can be therefore formulated in a suitable equivalence space. Here we recast the algorithm in the framework of what is called compositional data space; see, e.g., [11, 13]. To this aim, let

$$\begin{aligned} \mathcal {C}^{m} := \mathbb {R}^{m}_+/\sim , \end{aligned}$$

where

$$\begin{aligned} x \sim x' \quad :\Longleftrightarrow \quad \exists t > 0: \ x = t x'. \end{aligned}$$

The resulting equivalence class of x will be denoted by \(\underline{x}\). One specifies a vector addition and a scalar multiplication on \(\mathcal {C}^{m}\) via

$$\begin{aligned} \underline{x} + \underline{y} := \underline{x {}\circ {}y}, \qquad \gamma \cdot \underline{x} := \underline{x^\gamma }, \quad \gamma \in \mathbb {R}, \end{aligned}$$

where \(x^\gamma \) has the components \((x_1^\gamma ,\dots ,x_n^\gamma )\). As a result \((\mathcal {C}^{m}, +, \cdot )\) becomes a real vector space of dimension \(m-1\). In this space we consider the so called Hilbert norm

$$\begin{aligned} \Vert \underline{x} \Vert _H := \log \max _{i,j} \frac{x_i}{x_j}, \end{aligned}$$

turning \((\mathcal {C}^{m},{}+{}, {}\cdot {}, \Vert \cdot \Vert _H)\) into a finite dimensional Banach space. Note that this norm on the equivalence classes coincides with the well-known Hilbert distance on the representatives:

$$\begin{aligned} d_H(x,y) = \Vert \underline{x} - \underline{y} \Vert _H. \end{aligned}$$

Similarly we construct a Banach space \(\mathcal C^n = \mathbb {R}^n_+ /\sim \).

The modified Sinkhorn algorithm (2) can be interpreted as an iteration in the space \(\mathcal {C}^m \times \mathcal {C}^n\) and reads

$$\begin{aligned} \begin{aligned} \underline{u}_{\ell + 1}&= (1-\omega ) \cdot \underline{u}_\ell + \omega \cdot \underline{a} - \omega \cdot \mathcal {K}(\underline{v}_\ell ) , \\ \underline{v}_{\ell + 1}&= (1-\omega ) \cdot \underline{v}_\ell + \omega \cdot \underline{b} - \omega \cdot \mathcal {K}_\mathsf T(\underline{u}_{\ell + 1}), \end{aligned} \end{aligned}$$
(7)

where \(\mathcal {K}:\mathcal {C}^n \rightarrow \mathcal {C}^m\) and \(\mathcal {K}_\mathsf T:\mathcal {C}^m \rightarrow \mathcal {C}^n\) are now the nonlinear maps given by

$$\begin{aligned} \mathcal {K}(\underline{v}) = \underline{Kv}, \qquad \mathcal {K}_\mathsf T(\underline{u}) = \underline{K^\mathsf Tu}. \end{aligned}$$

The convergence of the standard Sinkhorn algorithm (\(\omega = 1\)) is based on a famous result of Birkhoff and Hopf on the contractivity of \(\mathcal {K}\) and \(\mathcal {K}_\mathsf T\). To state it, define the quantities

$$\begin{aligned} \eta (K) := \max \limits _{i,j,k,\ell } \frac{K_{ik}K_{j\ell }}{K_{jk}K_{i\ell }} \quad \text{ and } \quad \Lambda (K) := \frac{\sqrt{\eta (K)} - 1}{\sqrt{\eta (K)} + 1}. \end{aligned}$$
(8)

Then the following holds; for a proof, see, e.g., [14, Theorems 3.5 & 6.2].

Theorem 1 (Birkhoff–Hopf) For any \(K \in \mathbb {R}^{m \times n}_{+}\) and \(v,v' \in \mathbb {R}^{n}_+\) let \(\Lambda (K)\) be defined as above. Then

$$\begin{aligned} \sup _{v, v' \in \mathbb {R}^{m}_+} \frac{d_H(Kv,Kv')}{d_H(v,v')} = \Lambda (K). \end{aligned}$$

Note that \(\Lambda (K) = \Lambda (K^\mathsf T) < 1\). As a result, both \(\mathcal {K}\) and \(\mathcal {K}_\mathsf T\) are contractive maps in the Hilbert norm with Lipschitz constant \(\Lambda (K)\), which is also called the Birkhoff contraction ratio of K. Based on this, it is not difficult to establish the global convergence of the standard Sinkhorn algorithm in the space \(\mathcal C^m \times \mathcal C^n\) at a rate \(O(\Lambda (K)^2)\).

It is important to emphasize that studying the convergence in \(\mathcal C^m \times \mathcal C^n\), that is, convergence of equivalence classes, is sufficient for understanding the method in \(\mathbb {R}^m_+ \times \mathbb {R}^n_+\). Indeed, a pair \((\underline{u}^*, \underline{v}^*)\) is a fixed point of (7) if and only if for any choice of representatives \((u^*, v^*)\) there exist \(\lambda , \mu \) such that \(\lambda u {}\circ {}K v= a\) and \(\mu v {}\circ {}K^\mathsf Tu = b\). From \(\mathbf {1}_m^\mathsf Ta = \mathbf {1}_n^\mathsf Tb\) (here \(\mathbf {1}\) denotes a vector of all ones) it follows that \(\lambda =\mu \), and hence, e.g. \(u^+ := \lambda ^{-1/2} u^* \) and \(v^+ := \lambda ^{-1/2} v^*\) solve the initial problem (1), where \(\lambda = u^\mathsf TK v\). Moreover, choosing representatives \((u_\ell , v_\ell )\) of the iterates \((\underline{u}_\ell , \underline{v}_\ell )\) such that \(\mathbf {1}_m^\mathsf Tu_\ell = \mathbf {1}_n^\mathsf Tv_\ell = 1\), and setting \(u_\ell ^+ := \lambda _\ell ^{-\frac{1}{2} } u_\ell \), \(v_\ell ^+ := \lambda _\ell ^{-\frac{1}{2} } v_\ell \) with \( \lambda _\ell = u_\ell ^\mathsf TK v_\ell ^{}\), yields a sequence which converges exponentially fast to \((u^*,v^*)\).

We now briefly outline how the local convergence analysis for (7) can be conducted [5], leading to Theorem 1. By combining both steps of the iteration (7) into a nonlinear fixed point iteration \( (\underline{u}_{\ell + 1}, \underline{v}_{\ell + 1}) = \mathcal F(\underline{u}_\ell , \underline{v}_\ell ) \) in the space \(\mathcal C^m \times \mathcal C^n\), one finds that its derivative at the fixed point \((\underline{u}^*, \underline{v}^*)\) takes the form

$$\begin{aligned} M_\omega := (I_{m+n} - \omega \cdot L)^{-1}[(1 - \omega ) \cdot I_{m+n} + \omega \cdot U], \end{aligned}$$
(9)

where

$$\begin{aligned} L = \begin{pmatrix} 0 &{}\quad 0 \\ -\mathcal {K}_\mathsf T'(\underline{u}^*) &{}\quad 0 \end{pmatrix}, \qquad U = \begin{pmatrix} 0 &{}\quad -\mathcal {K}'(\underline{v}^*) \\ 0 &{} \quad 0 \end{pmatrix}. \end{aligned}$$

Matrices of the form \(M_\omega \) are well known as error iteration matrices of block SOR methods for linear systems. The spectral radius of \(M_\omega \) can be computed exactly from formula (4), if the spectral radius \(\vartheta \) of \(L + U\) is known; see [15, Sec. 6.2] or [16, Thm. 4.27]. The eigenvalues of \(L+U\), however, are square roots of the eigenvalues of the composition of derivatives \(\mathcal {K}'(\underline{v}^*) \mathcal {K}_\mathsf T'(\underline{u}^*)\), which is a linear map on \(\mathcal C^m\). It remains to show that the largest eigenvalue of that operator is precisely the second largest eigenvalue of the matrix M in (3). Indeed, by elementary calculations, M is the matrix representation of \(\mathcal {K}'(\underline{v}^*) \mathcal {K}_\mathsf T'(\underline{u}^*)\) under the isomorphism \(u \mapsto \underline{\exp (u)}\) between the subspace \(\{ u \in \mathbb {R}^m :\mathbf {1}_m^\mathsf Tu = 0 \} \subseteq \mathbb {R}^m\) and \(\mathcal C^m\).

3 Main results

We prove the global convergence of the modified method for a range of values \(\omega \) larger than one.

Theorem 3

Let \(\Lambda = \Lambda (K)\) be the Birkhoff contraction ratio of K. For \(0< \omega < \frac{2}{ 1+\Lambda }\), the modified Sinkhorn algorithm (7) converges, for any starting point, to \((\underline{u}^*, \underline{v}^*)\) exponentially fast.

Proof

Starting from (7), using the triangle inequality and the contractivity of \(\mathcal {K}\) and \(\mathcal {K}_\mathsf T\) provided by the Birkhoff-Hopf theorem, we obtain

$$\begin{aligned} \begin{aligned} \Vert \underline{u}_{\ell + 1}-\underline{u}^*\Vert _H&\le |1-\omega | \, \Vert \underline{u}_\ell -u_*\Vert _H + \omega \Lambda \, \Vert \underline{v}_\ell -\underline{v}^*\Vert _H , \\ \Vert \underline{v}_{\ell + 1}-\underline{v}^*\Vert _H&\le |1-\omega | \Vert \, \underline{v}_\ell -\underline{v}^*\Vert _H + \omega \Lambda \, \Vert \underline{u}_{\ell + 1}-\underline{u}^*\Vert _H \\&\le \left( |1-\omega |+(\omega \Lambda )^2\right) \Vert \, \underline{v}_\ell -\underline{v}^*\Vert _H + \omega \Lambda |1-\omega | \, \Vert \underline{u}_{\ell }-\underline{u}^*\Vert _H. \end{aligned} \end{aligned}$$

As a consequence, for \(\Delta u_\ell := \Vert \underline{u}_{\ell + 1}-\underline{u}^*\Vert _H\) and \(\Delta v_\ell := \Vert \underline{v}_{\ell + 1}-\underline{v}^*\Vert _H\) we obtain

$$\begin{aligned} {\Delta u_{\ell +1} \atopwithdelims ()\Delta v_{\ell +1}} \le T_\omega {\Delta u_{\ell } \atopwithdelims ()\Delta v_{\ell }}, \qquad \text { where} \quad T_\omega = \begin{pmatrix} |1-\omega | &{}\quad \omega \Lambda \\ \omega \Lambda |1-\omega | &{}\quad |1-\omega | + (\omega \Lambda )^ 2 \end{pmatrix}, \end{aligned}$$

and the vector inequality is understood entry-wise. Since all involved quantities are non-negative the inequality can be iterated, which gives

$$\begin{aligned} {\Delta u_{\ell +1} \atopwithdelims ()\Delta v_{\ell +1}} \le (T_\omega )^{\ell + 1} {\Delta u_{0} \atopwithdelims ()\Delta v_{0}}. \end{aligned}$$

Hence, to prove exponential convergence it suffices to show that the spectral radius of \(T_\omega \) is strictly less than one. Since the spectral radius equals \( |1-\omega | + \frac{(\omega \Lambda )^2 }{2} + \sqrt{\frac{(\omega \Lambda )^4}{4} + (\omega \Lambda )^2 |1-\omega |} \) , this is the case if and only if \(0< \omega < 2/(1 + \Lambda )\). \(\square \)

Next we provide a lower bound for the second largest eigenvalue \(\vartheta ^2\) of the matrix M in (3), which by (6) then yields an interval for \(\omega \) such that the modified method has a strictly faster asymptotic convergence rate than the standard Sinkhorn method.

Theorem 4

Let \({{\,\mathrm{rank}\,}}(K) = \min (m,n) \ge 2\) and

$$\begin{aligned} \delta _1 = \frac{a_{\min }}{b_{\max }} \cdot \frac{1 - b_{\max }}{\left( \frac{\Vert K \Vert _\infty }{\sigma _{\min }(K)}\right) ^2 - a_{\min }}> 0, \qquad \delta _2 = \frac{b_{\min }}{a_{\max }} \cdot \frac{1 - a_{\max }}{\left( \frac{\Vert K^\mathsf T\Vert _\infty }{\sigma _{\min }(K)} \right) ^2 - b_{\min }} >0, \end{aligned}$$

where \(\sigma _{\min }(K)\) is the smallest positive singular value of K, \(\Vert K \Vert _\infty = \max _{\Vert v \Vert _\infty = 1} \Vert K v \Vert _\infty \), and the subscripts \(\min \), \(\max \) denote the smallest and largest entry of the corresponding vector. Then it holds

$$\begin{aligned} \vartheta ^2 \ge \delta _{K,a,b} := {\left\{ \begin{array}{ll} \delta _1 &{}\quad \text {if m > n,} \\ \delta _2 &{}\quad \text {if m < n,} \\ \max (\delta _1,\delta _2) &{}\quad \text {if m = n.} \end{array}\right. } \end{aligned}$$

Note that for a positive matrix \(\Vert K\Vert _\infty > \sigma _{\min }(K)\). Moreover, \(a_{\min } \le \frac{1}{m} \le \frac{1}{n} \le b_{\max } < 1\) if \(m \ge n\), and vice versa if \(m \le n\). Hence \(\delta _{K,a,b}\) is indeed smaller than one, which is in line with the bound \(\vartheta ^2 \le \Lambda (K)^2\).

Proof

We consider the case \(m \ge n\). Instead of matrix M we consider the positive semidefinite matrix

$$\begin{aligned} H = {{\,\mathrm{diag}\,}}\left( \frac{u^*}{a^{1/2}} \right) K {{\,\mathrm{diag}\,}}\left( \frac{v^* {}\circ {}v^*}{b}\right) K^\mathsf T{{\,\mathrm{diag}\,}}\left( \frac{u^*}{a^{1/2}} \right) \in \mathbb {R}^{m \times m}, \end{aligned}$$

which is obtained from M by a similarity transformation (and using (1)). Since the dominant eigenvector of H (with eigenvalue one) is \(a^{1/2}\), we have

$$\begin{aligned} \vartheta ^2 = \max \left\{ \frac{ \langle w, H w \rangle }{\langle w, w \rangle } :\langle w, a^{1/2} \rangle = 0 \right\} . \end{aligned}$$

By projecting on the orthogonal complement of \(a^{1/2}\), and noting that \(\Vert a^{1/2} \Vert ^2 = 1\), we first rewrite this as

$$\begin{aligned} \vartheta ^2 = \max \frac{ \langle w, H w \rangle - \langle w, a^{1/2} \rangle ^2}{\langle w, w \rangle - \langle w, a^{1/2} \rangle ^2}, \end{aligned}$$

where the maximum is taken over all w that are not collinear to \(a^{1/2}\). For such w the numerator is always nonnegative and the denominator is positive. Next we substitute

$$\begin{aligned} w = a^{-1/2} {}\circ {}Kv^* {}\circ {}z \end{aligned}$$

with a new variable z. This yields

$$\begin{aligned} \vartheta ^2 = \max \frac{ \langle K^\mathsf Tz, \frac{v^* {}\circ {}v^*}{b} {}\circ {}K^\mathsf Tz \rangle - \langle K^\mathsf Tz, v^* \rangle ^2}{\left\langle z, \frac{Kv^* {}\circ {}Kv^*}{a} {}\circ {}z \right\rangle - \langle K^\mathsf Tz, v^* \rangle ^2}, \end{aligned}$$

where the maximum is taken over all z not collinear with \(u^*\) (the numerator is then nonnegative and the denominator is positive). To obtain a lower bound, we now evaluate the expression at z satisfying

$$\begin{aligned} K^\mathsf Tz = e_j \end{aligned}$$

where \(e_j\) denotes the j-th unit vector. Note that such z exists (\(K^\mathsf T\) has full row rank) and is indeed not collinear to \(u^*\), since otherwise \(K^\mathsf Tu^*\) would be collinear with \(e_j\), which contradicts \(K^\mathsf Tu^* {}\circ {}v^* = b\). Therefore, using this z, we get

$$\begin{aligned} \vartheta ^2 \ge \frac{\left( \frac{1}{b_{\max }} - 1 \right) (v^*)_j^2 }{ \left\langle z, \frac{Kv^* {}\circ {}Kv^*}{a} {}\circ {}z \right\rangle - (v^*)_j^2}. \end{aligned}$$

We can choose j as the position of a largest entry of the vector \(v^*\). Then in the denominator

$$\begin{aligned} \left\langle z, \frac{Kv^* {}\circ {}Kv^*}{a} {}\circ {}z \right\rangle \le \max _i \frac{(Kv^*)_i^2}{a_i} \Vert z \Vert ^2 \le \frac{ \Vert K \Vert _\infty ^2 (v^*)_j^2}{a_{\min }} \frac{1}{\sigma _{\min }(K)^2}. \end{aligned}$$

This leads to the asserted lower bound \(\vartheta ^2 \ge \delta _1\).

When \(m \le n\), we can simply interchange the roles of K and \(K^\mathsf T\), a and b, as well as \(u^*\) and \(v^*\) in this proof to obtain \(\vartheta ^2 \ge \delta _2\). \(\square \)

Taken together, Theorems 3 and  4 result in Theorem 2.

4 Numerical illustration

We illustrate the effect of overrelaxation by two numerical experiments related to optimal transport. The first is motivated by an application to color transfer between images [17]. The matrix \(K = K_\varepsilon \) is generated as

$$\begin{aligned} K_{ij} = \exp \left( - \frac{ \Vert x_i - y_j \Vert ^2}{\varepsilon }\right) \end{aligned}$$

where \(x_i, y_j \in \mathbb {R}^3\) are RGB values (scaled to [0, 1]) of \(m = n = 1000\) randomly sampled pixels in two different color images, respectively.Footnote 1 The vectors a and b are chosen as uniform distributions, i.e. \(a = \mathbf {1}_m/m\) and \(b = \mathbf {1}_n/n\). We choose \(\varepsilon = 0.01\). In this scenario the standard Sinkhorn method is reasonably fast, but still can be accelerated using overrelaxation. A typical outcome for different relaxation strategies is shown in Fig. 1 left, where we plot for 500 iterations the \(\ell _1\)-distance \(\Vert P_\ell - P_* \Vert _1\) between the matrices \(P_\ell = {{\,\mathrm{diag}\,}}(u_\ell ) K {{\,\mathrm{diag}\,}}(v_\ell )\) and a numerical reference solution \(P_* = {{\,\mathrm{diag}\,}}(u^*) K {{\,\mathrm{diag}\,}}(v^*)\). This error corresponds to the total variation distance of the corresponding transport plan. Even if this quantity (specifically \(P_*\)) is not available in a practical computation it is a natural measure for the convergence of the method. Besides the standard Sinkhorn method (\(\omega = 1\)), we run the method with a fixed relaxation \(\omega = 1.5\), and with the ‘optimal’ relaxation \(\omega ^{\mathrm{opt}}\), which is computed via formula (5) from the second largest singular value \(\vartheta \) of matrix \({{\,\mathrm{diag}\,}}(1/a^{1/2}) P_* {{\,\mathrm{diag}\,}}(1/b^{1/2})\) [then \(\vartheta ^2\) is the second largest eigenvalue of (3)]. We do not consider relaxation based on the lower bound on \(\vartheta ^2\) in Theorem 4, since the resulting \(\omega \) is too close to one. In all variants of the algorithm the same (uniformly) random starting vectors \(u_0\) and \(v_0\) are used.

Fig. 1
figure 1

Effect of different relaxation strategies in two examples

As can be seen, using \(\omega ^{\mathrm{opt}}\) significantly accelerates the convergence speed. Moreover, although \(\omega ^{\mathrm{opt}}\) only provides the optimal local rate, the positive effect shows quite immediately. However, the value of \(\omega ^{\mathrm{opt}}\) is a priori unknown in practice. Therefore we also tested a simple heuristic, similar to one suggested in [5]. It is known that the convergence of the Sinkhorn method can be monitored, e.g., through the error \(\Vert P_{\ell } \mathbf {1}_n - a \Vert _1\); cf. [4, Remark 4.14]. Therefore, since \(\vartheta ^2\) equals the asymptotic convergence rate of the standard Sinkhorn method, we may take

$$\begin{aligned} {\hat{\vartheta }}^2 = \sqrt{\frac{\Vert P_{\ell } \mathbf {1}_n - a \Vert _1}{\Vert P_{\ell -2} \mathbf {1}_n - a \Vert _1}} \end{aligned}$$

as a current approximation for \(\vartheta ^2\). In the purple curve (diamond markers) in Fig. 1 left, we updated \(\omega \) a single time after 20 steps of the standard method based on this quantity, and using formula (5). This comes at almost no additional cost, but yields the near optimal rate in this example. Of course such a heuristic could be applied in a more systematic way, e.g., by monitoring the changes of \(\left( \frac{\Vert P_{\ell } \mathbf {1}_n - a \Vert _1}{\Vert P_{\ell -p} \mathbf {1}_n - a \Vert _1} \right) ^{1/p}\) for a suitable value of p over several iterations. We note that adapting \(\omega \) in (linear and nonlinear) SOR methods based on currently observed convergence rates is a classical idea and has been proposed, e.g., in [15] or [19].

As a second example we consider a 1D transport problem between two random measures a and b (generated from a uniform distribution) on an equidistant grid in [0, 1], and with \(\ell _1\)-norm as a cost. The matrix K in this case is given as

$$\begin{aligned} K_{ij} = \exp \left( - \frac{|\frac{i}{m-1} - \frac{j}{n-1} |}{\varepsilon }\right) . \end{aligned}$$

Again we choose \(m=n = 1000\) and \(\varepsilon = 0.01\), and then compare different relaxation strategies, but starting from the same random intitialization \((u_0,v_0)\). As can be seen in Fig. 1 right, which shows 500 iterations with different relaxation strategies, this problems seems to be more difficult and the standard Sinkhorn method is extremely slow. A suitable relaxation compensates this and restores fast convergence, however, as illustrated by the slow convergence of the curve for \(\omega = 1.5\), the estimation of \(\omega ^{\mathrm{opt}}\), and hence of \(\vartheta ^2\), needs to be rather precise. Since here the convergence rate of the standard method stabilizes later, we apply the above heuristic of estimating \(\vartheta ^2\) only after 200 iterations of the standard iteration, resulting in the purple curve (diamond markers). The oscillatory behavior occurs because \(\omega \) is estimated larger than \(\omega ^{\mathrm{opt}}\), in which case the spectral radius \(\omega - 1\) of the linearized iteration matrix \(M_\omega \) in (9) is achieved at complex eigenvalues. It is possible in this example to update \(\omega \) earlier using computationally more expensive heuristics. For instance, the green curve (triangle markers) is obtained by computing after 50 iterations of the standard method an approximation of \(\vartheta \) as the second largest singular value of the matrix \({{\,\mathrm{diag}\,}}(1/a^{1/2}_\ell ) P_\ell {{\,\mathrm{diag}\,}}(1/b^{1/2}_\ell )\), where \(a_\ell = u_\ell {}\circ {}Kv_\ell \) and \(b_\ell = v_\ell {}\circ {}K^\mathsf Tu_\ell \). This could be done iteratively, we used the Matlab function svds. This results in an almost optimal convergence rate in this example. Of course, several similar strategies could be devised.