Convergence of an asynchronous block-coordinate forward-backward algorithm for convex composite optimization

In this paper, we study the convergence properties of a randomized block-coordinate descent algorithm for the minimization of a composite convex objective function, where the block-coordinates are updated asynchronously and randomly according to an arbitrary probability distribution. We prove that the iterates generated by the algorithm form a stochastic quasi-Fejér sequence and thus converge almost surely to a minimizer of the objective function. Moreover, we prove a general sublinear rate of convergence in expectation for the function values and a linear rate of convergence in expectation under an error bound condition of Tseng type. Under the same condition strong convergence of the iterates is provided as well as their linear convergence rate.

A4 F attains its minimum F * := min F on H.
To solve problem 1.1, we use the following asynchronous block-coordinate descent algorithm.It is an extension of the parallel block-coordinate proximal gradient method considered in [42] to the asynchronous setting, where an inconsistent delayed gradient vector may be processed at each iteration.
Algorithm 1.1.Let (i k ) k∈N be a sequence of i.i.d.random variables with values in [m] := {1, . . ., m} and p i be the probability of the event {i k = i}, for every i ∈ [m].Let (d k ) k∈N be a sequence of integer delay vectors, d k = (d k 1 , . . ., d k m ) ∈ N m such that max 1≤i≤m d k i ≤ min{k, τ } for some τ ∈ N.This delay vector is deterministic and independent from the block coordinates selection process (i k ) k∈N .Let (γ i ) 1≤i≤m ∈ R m ++ and x 0 = (x 0 1 , . . ., x 0 m ) ∈ H be a constant random variable.Iterate In this work, we assume the following stepsize rule where p max := max 1≤i≤m p i and p min := min 1≤i≤m p i .If there is no delay, namely τ = 0, the usual stepsize rule γ i < 2/L i is obtained [14,43].
The presence of the delay vectors in the above algorithm allows to describe a parallel computational model on multiple cores, as we explain below.

Asynchronous models
In this section we discuss an example of a parallel computational model, occurring in shared-memory system architectures, which can be covered by the proposed algorithm.Consider a situation where we have a machine with multiple cores.They all have access to a shared data x = (x 1 , . . ., x m ) and each core updates a block-coordinate x i , i ∈ [m], asynchronously without waiting for the others.The iteration's counter k is increased any time a component of x is updated.When a core is given a coordinate to update, it has to read from the shared memory and compute a partial gradient.While performing these two operations, the data x may have been updated by other cores.So, when the core is updating its assigned coordinate at iteration k, the gradient might no longer be up to date.This phenomenon is modelled by using a delay vector d k and evaluating the partial gradient at x k−d k as in Algorithm 1.1.Each component of the delay vector reflects how many times the corresponding coordinate of x have been updated since the core has read this particular coordinate from the shared memory.Note that different delays among the coordinates may arise since the shared data may be updated during the reading phase, so that the partial gradient ultimately is computed at a point which may not be consistent with any past instance of the shared data.This situation is called inconsistent read [6] and, in practice, allows a reading phase without any lock.By contrast, in a consistent read model [30,39], a lock is put during the reading phase and the delay originates only while computing the partial gradient.The delay is the same for all the block-coordinates, so that the value read by any core is a past instance of the shared data.However, for our theoretical study it does not make any difference considering an inconsistent or a consistent reading setting, because in the end only the maximum delay matters.In the literature other paper also consider the inconsistent read, see [31,8,16].
We remark that, in our setting, for all k ∈ N, the delay vector d k is considered to be a parameter that does not dependent on the random variable i k , similarly to the works [31,30,16,23].In this way, the stochastic attribute of the sequence (x k ) k∈N is not determined by the delay, but it only comes from the stochastic selection of the block-coordinates.Some papers consider the case where the delay vector is a stochastic variable that may depend on i k [45,8] or that it is unbounded [45,23].Those setting are natural extensions to our work that we are considering for future work.Finally, a completely deterministic model, both in the block's selection and delays is studied in [12].

Related work
The topic on parallel asynchronous algorithm is not a recent one.In 1969, Chazan and Miranker [9] presented an asynchronous method for solving linear equations.Later on, Bertsekas and Tsitsiklis [6] proposed an inconsistent read model of asynchronous computation.Due to the availability of large amount of data and the importance of large scale optimization, in recent years we have witnessed a surge of interest in asynchronous algorithms.They have been studied and adapted to many optimization problems and methods such as stochastic gradient descent [1,39,20,40,29], randomized Kaczmarz algorithm [32], and stochastic coordinate descent [2,30,41,51,45].
In general, stochastic algorithms can be divided in two classes.The first one is when the function f is an expectation i.e., f (x) = E[h(x; ξ)].At each iteration k only a stochastic gradient ∇h(•; ξ k ) is computed based on the current sample ξ k .In this setting, many asynchronous versions have been proposed, where delayed stochastic gradients are considered, see [36,20,3,10,28,34].The second class, which is the one we studied, is that of randomized block-coordinate methods.Below we describe the related literature.
The work [31] studied a problem and a model of asynchronicity which is similar to ours, but the proposed algorithm AsySPCD requires that the random variables (i k ) k∈N are uniformly distributed (i.e, p i = 1/m) and that the stepsize is the same for all the block-coordinates.This latter assumption is an important limitation, since it does not exploit the possibility of adapting the stepsizes to the block-Lipschitz constants of the partial gradients, hence allowing longer steps along blockcoordinates.A linear rate of convergence is also obtained by exploiting a quadratic growth condition which is essentially equivalent to our error bound condition [18].For a discussion on the limitations of [31] and the improvements we bring, see Remark 3.2 point (vi) and Section 6 on numerical experiments.
In the nonconvex case, [16] considers an asynchronous algorithm which may select the blocks both in an almost cyclic manner or randomly with a uniform probability.In the latter case, it is proved that the cluster points of the sequence of the iterates are almost surely stationary points of the objective function.However, the convergence of the whole sequence is not provided, nor is given any rate of convergence for the function values.Moreover, under the Kurdyka-Łojasiewicz (KL) condition [18,7], linear convergence is also derived, but it is restricted to the deterministic case.
To conclude, we note that our results, when specialized to the case of zero delays, fully recover the ones given in [42].

Contributions
The main contributions of this work are summarized below: • We first prove the almost sure weak convergence of the iterates (x k ) k∈N , generated by Algorithm 1.1, to a random variable x * taking values in argmin F .At the same time, we prove a sublinear rate of convergence of the function values in expectation, i.e, We also provide for the same quantity an explicit rate of O(1/k), see Theorem 3.1.
• Under an error bound condition of Luo-Tseng type, on top of the strong convergence a.s of the iterates, we prove linear convergence in expectation of the function values and in mean of the iterates, see Theorem 4.2.
We improve the state-of-the-art under several aspects: we consider an arbitrary probability for the selection of the blocks; the adopted stepsize rule improves over the existing ones, and coincides with the one in [16] in the special case of uniform selection of the blocks -in particular, it allows for larger stepsizes when the number of blocks grows; the almost sure convergence of the iterates in the convex and stochastic setting is new and relies on a stochastic quasi-Fejerian analysis; linear convergence under an error bound condition is also new in the asynchronous stochastic scenario.The rest of the paper is organized as follows.In the next subsection we set up basic notation.In Section 2 we recall few facts and we provide some preliminary results.The general convergence analysis is given in Section 3 where the main Theorem 3.1 is presented.Section 4 contains the convergence theory under an additional error bound condition, while applications are discussed in Section 5.The majority of proofs are postponed to Appendices A and B.
• and | • | represent the norms associated to their scalar product in H and in any of H i respectively.We also consider the canonical embedding, for all , with x i in the i th position.Random vectors and variables are defined on the underlying probability space (Ω, A, P).The default font is used for random variables while sans serif font is used for their realizations or deterministic variables.Let This operator defines an equivalent scalar product on H as follows where for all i ∈ [m], γ i and p i are defined in Algorithm 1.1.We set p max := max 1≤i≤m p i and p min := min 1≤i≤m p i .Let ϕ : H → ]−∞, +∞] be proper, convex, and lower semicontinuous.The domain of ϕ We recall that the proximity operator of ϕ is prox ϕ (x) = argmin y∈H ϕ(y) + 1 2 y − x 2 .If the function ϕ : H → R is differentiable, then for all u, x ∈ H and any symmetric positive definite operator A, we have ∇ A ϕ(x), u A = ∇ϕ(x), u , where ∇ A denotes the gradient operator in the norm • A .If S ⊂ H and x ∈ H, we set dist A (x, S) = inf z∈S x − z A .We also denote by prox A ϕ the proximity operator of ϕ with the norm • A .

Preliminaries
In this section we present basic definitions and facts that are used in the rest of the paper.Most of them are already known, and we include them for clarity.
In the rest of the paper, we extend the definition of x k by setting x k = x 0 for every k ∈ {−τ, . . ., −1}.Using the notation of Algorithm 1.1, we also set, for any k ∈ N With this notation, we have We remark that the random variables x k and xk+1 depend on the previously selected blocks, and related delays.More precisely, we have From (2.1) and (2.2), we derive and therefore, for every x ∈ H Suppose that x and x in H differ only for one component, say that of index i, then it follows from Assumption A3 and the Descent Lemma [37, Lemma 1.2.3], that (2.7) We finally need the following results on the convergence of stochastic quasi-Fejér sequences and monotone summable positives sequences.Fact 2.1 ([13], Proposition 2.3).Let S be a nonempty closed subset of a real Hilbert space H. Let F = (F n ) n∈N be a sequence of sub-sigma algebras of F such that (∀n ∈ N) F n ⊂ F n+1 .We denote by + (F ) the set of sequences of R + -valued random variables (ξ n ) n∈N such that, for every n ∈ N, ξ n is F n -measurable.We set be a sequence of H-valued random variables.Suppose that, for every z ∈ S, there exist , and (η n (z)) n∈N ∈ 1 + (X ) such that the stochastic quasi-Féjer property is satisfied P-a.s.: Then the following hold: (ii) Suppose that the set of weak cluster points of the sequence (x n ) n∈N is P-a.s.contained in S.
, where for all

Auxiliary lemmas
Here we collect technical lemmas needed for our analysis, using the notation given in (2.1).For reader's convenience, we provide all the proofs in Appendix A.
The following result appears in [31, page 357].
The next lemma bounds the difference between the delayed and the current gradient in terms of the steps along the block coordinates, see [31, equation A.7].
Lemma 2.6.Let (x k ) k∈N be the sequence generated by Algorithm 1.1.It follows We set The result below yields a kind of inexact convexity inequality due to the presence of the delayed gradient vector.It is our variant of [31,Equation A.20].
Lemma 2.8.Let (x k ) k∈N be a sequence generated by Algorithm 1.1.Then, for every k ∈ N, The result below generalizes to the asynchronous case Lemma 4.3 in [42].
Lemma 2.9.Let H be a real Hilbert space.Let ϕ : H → R be differentiable and convex, and ψ : H →] − ∞, +∞] be proper, lower semicontinuous and convex.Let x, x ∈ H and set

Convergence analysis
In this section we assume just convexity of the objective function and we provide worst case convergence rate as well as almost sure weak convergence of the iterates.
Throughout the section we set where the constants L i 's and L res are defined in Assumption A3 and the constant L V res is defined in Remark 2.7.The main convergence theorem is as follows.
Theorem 3.1.Let (x k ) k∈N be the sequence generated by Algorithm 1.1 and suppose that δ < 2. Then the following hold.
(i) The sequence (x k ) k∈N weakly converges P-a.s. to a random variable that takes values in argmin F .
(i) Theorem 3.1 extends classical results about the forward-backward algorithm to the asynchronous and stochastic block-coordinate setting.See [43] and reference therein.Moreover, we note that the above results, when specialized to the synchronous case, that is, τ = 0, yield exactly [42, Theorem 4.9].The o(1/k) was also proven in [27].
(ii) The almost sure weak convergence of the iterates for the asynchronous stochastic forwardbackward algorithm is new.In general only convergence in value is provided or, in the nonconvex case, cluster points of the sequence of the iterates are proven to be almost surely stationary points [16,8].
(iii) As it can be readily seen from statement (ii) in Theorem 3.1, our results depend only on the maximum possible delay, and therefore apply in the same way to the consistent and inconsistent read model.
(iv) If we suppose that the random variables (i k ) k∈N are uniformly distributed over [m], the stepsize rule reduces to γ i < 2/(L i + 2τ L res / √ m), which agrees with that given in [16] and gets better when the number of blocks m increases.In this case, we see that the effect of the delay on the stepsize rule is mitigated by the number of blocks.In [8] the stepsize is not adapted to the blockwise Lipschitz constants L i 's, but it is chosen for each block as γ < 2/(2L f + τ 2 L f ) with L f ≥ L res , leading, in general, to smaller stepsizes.In addition, this rule has a worse dependence on the delay τ and lacks of any dependence on the number of blocks.
(v) The framework of [8] is nonconvex and considers more general types of algorithms, in the flavour of majorization-minimization approaches [24].On the other hand the assumptions are stronger (in particular, they assume F to be coercive) and the rate of convergence is given with respect to , a quantity which is hard to relate to F (x k ) − F * .They also prove that the cluster points of the sequence of the iterates are almost surely stationary points.
(vi) The work [31] was among the first to study an asynchronous version of the randomized coordinate gradient descent method.There, the coordinates were selected at random with uniform probability and the stepsize was chosen the same for every coordinate.However, the stepsize was chosen to depend exponentially on τ , i.e as O(1/ρ τ ) with ρ > 1, which is much worse than our O(1/τ ).The same problem affects the constant in front of the bound of the rate of convergence which indeed is of the form O(ρ τ ).
To circumvent these limitations above they put a condition in Corollary 4.2 that bounds how big the maximum delay τ can be: where m is the dimension of the space.However, this inequality is never satisfied if Λ > √ m/(4e), since this would imply contradicting the fact that τ is a non-negative integer.An example where this happens is when we are dealing with a quadratic function with positive semidefinite Hessian Q ∈ R n×n .In this case Say one column of Q has constant entries equal to p > 0, while the absolute value of all the other entries of Q are less than p.Then, In Section 6, we show two experiments on real datasets for which condition (3.2) is not verified.
Before giving the proof of Theorem 3.1, we present few preliminary results.The first one is a proposition showing that the function values are decreasing in expectation.The proof of this proposition, as well as those of the next intermediate results, are given in Appendix B. Proposition 3.3.Assume that δ < 2 and let (x k ) k∈N be the sequence generated by Algorithm 1.1.Then, for every k ∈ N, where Lemma 3.4.Let (x k ) k∈N be the sequence generated by Algorithm 1.1.Then for every k ∈ N, we have where α k is defined in Proposition 3.3.
Proposition 3.6.Let (x k ) k∈N be a sequence generated by Algorithm 1.1 and suppose that δ < 2. Let (x k ) k∈N and (α k ) k∈N be defined as in (2.1) and in Proposition 3.3 respectively.Then, for every k ∈ N, Next we state a proposition that we will use throughout the rest of this paper.It corresponds to [42,Proposition 4.6].Proposition 3.7.Let (x k ) k∈N be a sequence generated by Algorithm 1.1 and suppose that δ < 2. Let (α k ) k∈N be defined as in Proposition 3.3.Then, for every k ∈ N, In the following, we show a general inequality from which we derive simultaneously the convergence of the iterates and the rate of convergence in expectation of the function values.Proposition 3.8.Let (x k ) k∈N be a sequence generated by Algorithm 1.1 and suppose that δ < 2. Let (α k ) k∈N be defined as in Proposition 3.3.Then, for all x ∈ H, ) Proposition 3.9.Let (x k ) k∈N be a sequence generated by Algorithm 1.1 and suppose that δ < 2. Let (x k ) k∈N be defined as in (2.1).Then there exists a sequence of H-valued random variables (v k ) k∈N such that the following assertions hold: (ii) v k → 0 and x k − xk+1 → 0 P-a.s., as k → +∞.
We are now ready to prove the main theorem.
Proof of Theorem 3.1.(i): It follows from Proposition 3.8 that where (ξ k ) k∈N is a sequence of positive random variable which is P-a.s.summable.Thus, the sequence (x k ) k∈N is stochastic quasi-Fejér with respect to argmin F in the norm • W (which is equivalent to • ).Then according to Fact 2.1 it is bounded P-a.s.We now prove that argmin F contains the weak cluster points of (x k ) k∈N P-a.s.Indeed, let Ω 1 ⊂ Ω with P(Ω\Ω 1 ) = 0 be such that items (i) and (ii) of Proposition 3.9 hold.Let ω ∈ Ω 1 and let x be a weak cluster point of (x k (ω)) k∈N .There exists a subsequence (x kq (ω)) q∈N which weakly converges to x.By Proposition 3.9, we have xkq+1 (ω) x, v kq+1 (ω) → 0, and v kq+1 (ω) ∈ ∂(f + g)(x kq+1 (ω)).Thus, [35, Proposition 1.6 (demiclosedness of the graph of the subgradient)] yields 0 ∈ ∂F (x) and hence x ∈ argmin F .Therefore, again by Fact 2.1 we conclude that the sequence (x k ) k∈N weakly converges to a random variable that takes value in argmin F P-a.s.
(ii): Choose x ∈ argmin F in Proposition 3.8 and then take the expectation.Then we get

Linear convergence under error bound condition
In the previous section we get a sublinear rate of convergence.Here we show that with an additional assumption we can get a better convergence rate.Also, we derive a strong convergence of the iterates, improving the weak convergence proved in Theorem 3.1.We will assume that the following Luo-Tseng error bound condition [33] holds on a subset X ⊂ H (containing the iterates x k ).
Remark 4.1.We recall that the condition above is equivalent to the Kurdyka-Lojasiewicz property and the quadratic growth condition [18,7,42].Any of these conditions can be used to prove linear convergence rates for various algorithms.
The following theorem is the main result of this section.Here, linear convergence of the function values and strong convergence of the iterates are ensured.Theorem 4.2.Let (x k ) k∈N be generated by Algorithm 1.1 and suppose δ < 2 and that the error bound condition (4.1) holds with X ⊃ {x k | k ∈ N} P-a.s. for some C X,Γ −1 > 0. Then for all k ∈ N, where (ii) The sequence (x k ) k∈N converges strongly P-a.s. to a random variable x * that takes values in Proof.(i): From Proposition 3.6 we have where V .Now, taking x ∈ argmin F and using the error bound condition 4.1 and equation 3.3, we obtain Adding and removing F * in both expectation yield where where in the last equality we used Lemma 3.5.From (3.3), we have, for k such that k − τ ≥ 0, Because the sequence E F (x k ) + α k k∈N is decreasing, the transition from the second line to the third one is allowed.Using (4.4) and (4.5) in (4.3) with total expectation, we obtain where So (4.7) remains true.Also from (B.10), we have (ii): From Jensen inequality, (3.3) and (4.7), we have Therefore k∈N x k+1 − x k Γ −1 < ∞ P-a.s.This means the sequence (x k ) k∈N is a Cauchy sequence P-a.s.By Theorem 3.1 (i), this sequence has accumulation points that take values in argmin F .So it converges strongly P-a.s. to a random variable that takes values in argmin F .Now let ρ = 1 − p min /(κ + θ).For all n ∈ N, Letting n → ∞ and using (4.8), we get Remark 4.3.
(i) A linear convergence rate is also given in [31,Theorem 4.1] by assuming a quadratic growth condition instead of the error bound condition (4.1).Their rate depend on the stepsize which in general can be very small, as explained earlier in point (vi) of Remark 3.2.
(ii) The error bound condition (4.1) is sometimes satisfied globally, meaning on X = dom F , so that the condition X ⊃ {x k | k ∈ N} P-a.s.required in Theorem 4.2 is clearly fulfilled.This is the case when F is strongly convex or when f is quadratic and g is the indicator function of a polytope (see Remark 4.17(iv) in [42]).More often, for general convex objectives, the error bound condition (4.1) is satisfied on sublevel sets of F (see [42,Remark 4.18]).Therefore, it is important to find conditions ensuring that the sequence (x k ) k∈N remains in a sublevel set.
The next results address this issue.
We first give an analogue of Lemma 3.4.
Lemma 4.4.Let (x k ) k∈N be the sequence generated by Algorithm 1.1.Then, for every k ∈ N, Proof.Let k ∈ N. We have, from Cauchy-Schwarz inequality, the Young inequality and Lemma 2.5, that Using the same decomposition of the last term as in Lemma 3.4 , we get So taking By minimizing s → (s/2 + τ 2 L 2 res /(2s)), we find s = τ L res .We then obtain and the statement follows.
Proposition 4.5.Let (x k ) k∈N be the sequence generated by Algorithm 1.1.Then, for every k ∈ N, ) Proof.Using Lemma 4.4 in equation (B.3), we have So the statement follows.
Corollary 4.6.Let (x k ) k∈N be generated by Algorithm 1.1 with the γ i 's satisfying the following stepsize rule

.11)
So if the error bound condition (4.1) holds on the sublevel set X = {F ≤ F (x 0 )}, then the assumptions of Theorem 4.2 are met.
Proof.The left hand side in (4.9) is positive and hence (F (x k )+ αk ) k∈N is decreasing P-a.s.Therefore, we have, for every k ∈ N Remark 4.7.The rule (4.10) yields stepsizes possibly smaller than the ones given in Theorem 3.1, which requires γ i < 2/(L i + 2τ L res p max / √ p min ).Indeed this happens when p max / √ p min < 1.For instance if the distribution is uniform, we have p max / √ p min = 1/ √ m < 1 whenever m ≥ 2. On the bright side, there may exist distributions for which p max / √ p min > 1.

Applications
Here we present two problems where Algorithm 1.1 can be useful.

The Lasso problem
We start with the Lasso problem [47], also known as basis pursuit [11].It is a least-squares regression problem with an 1 regularizer which favors sparse solutions.More precisely, given A ∈ R n×m and b ∈ R n , one aims at solving the following problem We clearly fall in the framework of problem (1.1) with f (x) = (1/2) Ax − b 2 2 and g i (x i ) = λ|x i |.The assumptions A1, A2, A3 and A4 are also satisfied.In particular, here L i = a i 2 , where a i is the i-th column of A, L res = max i A A •i 2 , with A A •i the i-th column of A A, and F = f + g attains its minimum.
The Lasso technique is used in many fields, especially for high-dimensional problems -among others it is worth mentioning statistics, signal processing, and inverse problems; see [4,48,25,5,17,46] and references therein.Since there is no closed form solution for this problem, many iterative algorithms have been proposed to solve it: forward-backward, accelerated (proximal) gradient descent, (proximal) block coordinate descent, etc. [15,4,38,22,49,21].In the same vein, applying Algorithm 1.1 to the Lasso problem (5.1) yields the iterative scheme: where, for every ρ > 0, soft ρ : R → R is the soft thresholding operator (with threshold ρ) [43].Thanks to Theorem 3.1 we know that the iterates (x k ) k∈N generated are weakly convergent and the function values have a convergence rate of o(1/k).On top of that the cost function of the Lasso problem (5.1) satisfies the error bound condition (4.1) on its sublevel sets [50, Theorem 2].So, following Corollary 4.6 and Theorem 4.2, the iterates converge strongly (a.s.) and linearly in mean, whenever γ i < 2/ (L i + 2τ L res ), for all i ∈ [m].

Linear convergence of dual proximal gradient method
We consider the problem where, for all i ∈ [m], A i : H → G i is a linear operator between Hilbert spaces, φ i : is proper convex and lower semicontinuous, and h : H → ]−∞, +∞] is proper lower semicontinuous and σ-strongly convex (σ > 0).The first term of the objective function may represent the empirical data loss and the second term the regularizer.This problem arises in many applications in machine learning, signal processing and statistical estimation, and is commonly called regularized empirical risk minimization [44].It includes, for instance, ridge regression and (soft margin) support vector machines [44], more generally Tikhonov regularization [26,Section 5.3].
In the following we apply Algorithm 1.1 to the dual of problem (5.3).Below we provide details.Set G = m i=1 G i and u = (u 1 , u 2 , . . ., u m ).Then, the dual of problem (5.3) is minimize u∈G where, A * i is the adjoint operator of A i h * and φ * i are the Fenchel conjugates of h and φ i respectively.The link between the dual variable u and the primal variable x is given by the rule u → ∇h * (− m i=1 A * i u i ).Since h * is (1/σ)-Lipschitz smooth, the dual problem above is in the form of problem (1.1).Thus, Algorithm (1.1) applied to the dual problem (5.4) gives Suppose that ∇h * = B is a linear operator and that the delay vector Then, using the primal variable, the KKT condition , and the fact that u k+1 and u k differ only on the i k -component, the algorithm becomes (5.6) The above algorithm requires a lock during the update of the primal variable x.On the contrary, the update of the dual variable u is completely asynchronous without any lock as in the setting we studied in this paper.To get a better understanding of this aspect, we will expose a concrete example: the ridge regression.

Example: Ridge regression
The ridge regression is the following regularized least squares problem. (5.7) where K = XX * and X : H → R m , with Xw = ( w, x i ) 1≤i≤m .We remark that, in this situation, With w k = X * u k and considering that the non smooth part g is null, the algorithm is given by (5.8) Remark 5.1.Now we will compare the above dual asynchronous algorithm to the asynchronous stochastic gradient descent (ASGD) [39,1].We note that (5.8) yields Instead, applying asynchronous SGD to the primal problem (5.7) multiply by λm, we get We see that the only difference is the second term inside the parentheses in both updates.Indeed the term x i k in our algorithm.However, a major difference between the two approaches lies in the way the stepsize is set.Indeed, in ASGD, the stepsize γ k is chosen with respect to the operator norm of K + λmId i.e., the Lipschitz constant of the full gradient of the primal objective function, see [1,Theorem 1].By contrast, in algorithm (5.8), for all i ∈ [m], the stepsizes γ k i are chosen with respect to the Lipschitz constant of the partial derivatives of the dual objective function i.e., K i,i + λm.Not only the latter are easier to compute, they also allow for possibly longer steps along the coordinates.

Experiments
In this section, we will present some experiments with the purpose of assessing our theoretical findings and making comparison with related results in the literature.All the codes are available on GitHub 1 .
We coded the mathematical model of asynchronicity in (1.2).At each iteration we compute the forward step using gradients that are possibly outdated.The delay vector components are a priori chosen according to a uniform distribution on {0, 1, . . ., τ }.The block coordinates are updated with a uniform distribution independent from the delay vector.We considered three kinds of experiments: in the first one we did a speedup test for our algorithm on the Lasso problem.This allows to check whether the speed of convergence increases linearly with the number of machines used.Then, we considered a comparison with the synchronous version of the algorithm in order to show the advantage of the asynchronous implementation.Finally, in the third group of experiments we compared our algorithm with those by Liu et al [31] and Cannelli et al [8].

Speedup test
In this section we consider the Lasso problem (5.1) with n = 100 and m ∈ {500, 1000, 2000, 8000}.λ is chosen small enough so that the minimizer x * has some non zero components.For more flexibility, we used synthetic data, which were generated using the function MAKE_CORRELATED_DATA of the python library CELER.This function creates a matrix A with columns generated according to the Autoregressive (AR) model2 .Then b is generated as b = Aw + , where is a Gaussian random vector, with zero mean and variance equal to the identity, such that the signal to noise ration (SNR) is 3 and w is a vector with 1% of nonzero entries.The nonzero blocks of w are chosen uniformly and their entries are generated according to the standard normal distribution.As in [31,8], we make the assumption that τ is proportional to the number of machines.Since we use 10 cores, we fix τ = 10 like in [28].For a fixed data, we run the algorithm 10 times and average it.Similarly to [31,8], in our experiment the speedup gets better when we increase the number of blocks, see Figure 1.This can be explained by the fact that the algorithm has to run long enough in order to minimize the cost of parallelization -the initialization cost, the mandatory locks in order to avoid data racing, etc.Also, if there are more blocks, the probability of two machines having to write to the same block at the same time is reduced and so is the number of locks.All these observations align with the known fact that the more there are cores, the more the problem should be complex to see good speedup.

Comparison with the synchronous version
We compared Algorithm 1.1 to its synchronous counterpart in the Lasso case.The data, as well as the parameters, is generated as in the speedup experiment.The step size of the synchronous algorithm is set as suggested in [42] for a non sparse matrix A. We run both algorithms for 120 seconds and compare the distances of their function values to the minimum.As expected, Algorithm 1.1 is faster; see Figure 2.

Comparison with other asynchronous algorithms
In this section we illustrate the results of the comparison with the algorithms proposed in [31] and [8].As for [8], we set (in the notation of the paper) the relaxation parameter γ = 1 and c f = 2β so that Then, according to Theorem 1 in [8], we choose 2β > L f (1 + δ 2 /2) where δ = τ is the maximum delay.We note that this model is slightly different from ours since the delay is present not only in the gradient.
In [31], the same algorithm as (1.2) is considered, but with a stepsize γ which is the same for all the blocks.In our comparisons, we choose the step according to the conditions required by the main Theorem 4.1 in [31], since the hypotheses of Corollary 4.2 are not satisfied for our datasets 3 , see the discussion in Remark 3.2 (vi).If τ is the maximum delay, Theorem 4.1 in [31] requires the following   conditions on the stepsize: which only makes sense if the right hand side is strictly positive, so when n > 16 and ρ > 1+4/ √ n 1−16/n (instead of ρ > 1 + 4/ √ n as claimed in [31]).So, in the experiments, we set ρ > 1+4/ √ n 1−16/n .This leads in general to very small stepsizes, as we will further discuss in the next section.

Lasso problem
In this section we consider the Lasso problem (5.1) with m = 90, n = 51630, and λ = 0.01.We use the data YEARPREDICTIONMSD.T from LIBSVM 4 to generate the matrix A. Before showing the results, we briefly comment on the experimental set-up.As shown in Section 5.1, in this case and L res = max i A A •i 2 .In [8], L f = L res and in [31] Looking at the results, we see that our algorithm outperforms those in [31] and [8], see Figure 3.This difference is due to the fact that our stepsize is bigger than the other two.Indeed, in [31] and [8] the stepsizes have a worse dependence on the maximum delay τ (inverse quadratically in [8] and exponentially in [31]), which ultimately shorten the stepsizes.Also, in both [31] and [8] Figure 4: This figure shows how the minimum of our stepsizes fares against the two others when τ increases on a lasso problem. the stepsize is the same for all the blocks, so the algorithm is more sensitive to the conditioning of the problem.An overall comparison of the effect of τ on the stepsize is shown in Figure 4.

Logistic regression
For another comparison, next we consider the 1 regularized logistic loss: For this experiment we use the data SPLICE.T from LIBSVM 5 with m = 60, n = 2175, and λ = 0.01.Let A ∈ R m×n be the matrix with columns the a i 's (i ∈ [n]).We denote by • , • ∞ , • F , the spectral norm, the infinity norm, and the Frobenius norm of matrices, respectively.The relevant constants for the stepsizes are • L res = 1 n A max j A j• 2 for our algorithm and [31], • for our algorithm, where A j• is the j-th row of A.

Appendices A Proofs of the auxiliary Lemmas in Section 2
In this section, for reader's convenience, we provide detailed proofs of the Lemmas presented in Section 2, even though they are mostly not original.They are adapted from or can be found, e.g., in [31,42].
Proof of Lemma 2.5.Let k ∈ N. Since, for every i ∈ [m], d k i ≤ min{k, τ }, we have where at most one summand is different from zero, because the difference between x h and x h+1 is only in the i h -th component.So Proof of Lemma 2.6.Let k ∈ N, let p = card(J(k)), and let (h j ) 1≤j≤p be the elements of J(k) ordered in (strictly) increasing order.Then, from Lemma 2.5 we have Let's set, for each t ∈ {0, . . ., p} Then it follows xk,0 = xk , xk,p = x k , and ∀ t ≥ 1 xk,t − xk,t−1 = x ht+1 − x ht . Therefore and xk,t , xk,t−1 differ only in the value of a component.Thus from which the result follows.
Proof of Lemma 2.8.Let k ∈ N and x ∈ H. Then Thanks to the convexity of f and (2.7), it follows Using the equality of the square of sum, Holder inequality and L max ≤ L res , we finally get The statement follows.

B Proofs of Section 3
Proof of Lemma 3.4.Let k ∈ N. We have, from Cauchy-Schwarz inequality, the Young inequality and Remark 2.7, that Now, thanks to a decomposition of the last term by Fact 2.4, we obtain We recall that we get and Proof of Lemma 3.5.We have Thus, taking the conditional expectation we have Proof of Proposition 3.3.Let k ∈ N. We have from the descent lemma along the i k -th blockcoordinate, From (2.5), we can write that By taking the conditional expectation and using Fact 2.2, it follows: We then plug this result in (B.4) obtaining Since δ < 2, recalling (3.1), we have, for all i ∈ [m], Therefore the statement follows.
Proof of Proposition 3.6.Let k ∈ N and x ∈ H. Since ∇f ) , we derive from Lemma 2.9 above written in weighted norm that From Lemma 2.8, we have So (B.5) becomes Next, recalling that x k and x k+1 differs only in the i k -th component, we have where in the last inequality we used that which was derived from (2.5).So Now, by Lemma 3.4 and the block-coordinate descent lemma (2.6), we have where α k = L V res /(2 √ p max ) k−1 h=k−τ (h − (k − τ ) + 1) x h+1 − x h 2 V for all k ∈ N. Therefore g(x k ) − g(x k+1 ) + ∇f (x k ), Since γ i L i + 2γ i τ L V res √ p max ≤ δ < 2, we have and hence (B.7) yields g(x k ) − g(x k+1 ) + ∇f (x k ), The statement follows from (B.6).
Proof of Proposition 3.7.We know that We derive from Proposition 3.6, multiplied by 2, that  The statement follows.
Proof of Proposition 3.9.It follows from (3.3) that This means that E[F (x k ) + α k ] k∈N is a nonincreasing sequence and Define, for all i ∈ [m], Then, thanks to the second equation in (2.4), we have

Figure 1 :
Figure 1: The plots showed the speedup obtain by Algorithm 1.1 compared to the ideal speedup for different number of blocks.The shaded zones illustrate the standard deviation of the results over 10 trials.

Figure 2 :
Figure 2: Comparison of Algorithm 1.1 to its synchronous counterpart.

Figure 3 :
Figure 3: The plots show the behavior of F (x k ) − F * for the 3 algorithms applied to a lasso loss with different values of τ : 5, 10, 15, 20.

Figure 5 :
Figure 5: The plots show the behavior of F (x k ) − F * for the 3 algorithms applied to a regularized logistic loss for different values of τ : 5, 10, 15, 20.