Linear Convergence Rate Analysis of a Class of Exact First-Order Distributed Methods for Weight-Balanced Time-Varying Networks and Uncoordinated Step Sizes

We analyze a class of exact distributed first order methods under a general setting on the underlying network and step-sizes. In more detail, we allow simultaneously for time-varying uncoordinated stepsizes and time-varying directed weight-balanced networks, jointly connected over bounded intervals. The analyzed class of methods subsumes several existing algorithms like the unified Extra and unified DIGing (Jakovetic, 2019), or the exact spectral gradient method (Jakovetic, Krejic, Krklec Jerinkic, 2019) that have been analyzed before under more restrictive assumptions. Under the assumed setting, we establish R-linear convergence of the methods and present several implications that our results have on the literature. Most notably, we show that the unification strategy in (Jakovetic, 2019) and the spectral step-size selection strategy in (Jakovetic, Krejic, Krklec Jerinkic, 2019) exhibit a high degree of robustness to uncoordinated time-varying step sizes and to time-varying networks.


Introduction
We consider a set of n computational agents and the following unconstrained optimization problem min where each f j is a real-valued function of R d and is held privately by one of the agents, and the agents can communicate according to a given network.Problems of this form arise in many practical applications such as sensor networks [15], distributed control [17], distributed learning [4] and many others.
Several distributed methods [21,26,35,18,28,13,10,11,12] have been proposed in literature for the solution of (1) that achieve exact convergence to the minimizer with fixed step-size, when the objective function is convex and Lipschitz-differentiable.
In [18,21] and [26] two exact gradient-based methods where proposed, and the convergence was proved for the case where the underlying network is undirected, connected, and remains constant through the entire execution of the algorithm.In [13] a unified analysis of a class of first-order distributed methods is presented.In [28] the convergence of several first-order methods was generalized to the case of a time-varying network, provided that the network is connected at each iteration, while in [18] the convergence analysis of [21] is extended to the time-varying and directed case, assuming joint-connectivity of the sequence of networks and weight-balance of each graph 1 .Interestingly, the exact first order methods are also related with augmented Lagrangian algorithms, e.g., [20].For example, the methods in [18,21] and [13] have been shown to be equivalent to certain primal-dual methods that optimize an augmented Lagrangian function associated with the original problem.In [24] an accelerated gradient-based method for the time-varying directed case is proposed, with weaker assumptions over the underlying networks.In [16] the authors propose a mirror descent method that assumes time-varying jointly-strongly-connected networks.In [8] the authors considered the problem of minimizing f (y)+G(y) over a closed and convex set K, where f is a possibly non-convex function as in (1) and G is a convex nonseparable term, and they propose a gradient-tracking method that achieves convergence in the case of time-varying directed jointly-connected networks for diminishing synchronized step sizes.In [25] the method proposed in [8] is extended with constant step-sizes to a more general framework while in [27] R-linear convergence is proved for [25] with strongly convex f (y).A unifying framework of these methods is presented in [34] and, for the case of constant and undirected networks, in [1].In all the above methods the sequence of the step-sizes is assumed to be fixed and coordinated among all the agents.In [19], [35], [32], [33], and [36] the case of uncoordinated time-constant step sizes is considered, that is, each node has a different step-size but these step sizes are constant in all iterations.In [14] a modification of [21] is proposed, with step-sizes varying both across nodes and iterations, and it is proved that there exist suitable safeguards for the steps, depending on the regularity properties of the objective function and the network, such that R-linear convergence of the generated sequence to the solution of (1) holds.This result is obtained for undirected and stationary network.In [29] and [30] asynchronous modifications of [25] are proposed.
Of special interest to the current paper is the spectral gradient method (or Barzilai and Borwein method).This method is very popular in centralized optimization due to its efficiency, as reported in numerous studies, for example [5,9].In general, the method avoids the famous zig-zag behaviour of the steepest descent and converges much faster.The method was first proposed by Barzilai and Borwein [3].This reference proves the method's convergence for two-dimensional problems and convex quadratic functions.The analysis is then extended to arbitrary dimensions and convex quadratic functions by Raydan [22].Minimization of generic functions is considered in [23] in combination with a nonmonotone line search.R-linear convergence for convex quadratic functions has been proved in [6].In summary, despite its excellent numerical performance, spectral gradient methods are proved to converge without any safeguarding lower and upper bounds on the step size only for strongly convex quadratic costs.Convergence for generic functions beyond convex quadratic is proved only under step size safeguarding, coupled with a line search strategy.Distributed variants of spectral gradient methods and fixed network topologies are studied in [14].
We now summarize this paper's contributions.We establish R-linear convergence of a class of exact distributed first-order methods under the general setting of time-varying directed weight-balanced networks, without the requirement of network connectedness at each iteration, and in the presence of time-varying uncoordinated step sizes.While there have been several existing studies of exact distributed methods under general settings, our study implies several new contributions to the literature; these contributions cannot be derived from existing works and are novelties of this paper.
• We prove that the methods proposed in [13], referred to here (and also in [28]) as the unified Extra and the unified DIGing are robust to timevarying directed networks and time-varying uncoordinated step sizes, i.e., they converge R-linearly in this setting.Up to now, it is only known that these methods converge under static undirected networks [13] or time-varying networks where the network is connected at each iteration [28].These methods have been previously considered only for time-invariant coordinated step sizes.
• We prove that the method proposed in [14] is robust to time-varying directed networks.Before the current paper, the method was only known to converge for static, undirected networks.
• It is shown in [28] that the Extra method [26] may diverge over timevarying networks, even when the network is connected at every iteration.On the other hand, as we show here, the unified Extra, a variant of Extra proposed in [13], is robust to time-varying networks.Hence, our results reveal that the unified Extra can be considered as a mean to modify Extra and make it robust.
• We provide a thorough numerical study and an analytical study for a special problem structure that demonstrates that the unification strategy in [13] and the spectral gradient-like step-size selection strategy in [14] exhibit a high degree of robustness to time-varying networks and uncoordinated time-varying step-sizes.More precisely, we show that these strategies converge, when working on time-varying networks, for wider step-size ranges than commonly used strategies such as constant coordinated step-sizes and DIGing algorithmic forms.In addition, we show by simulation that actually a combination of the unification and the spectral step-size strategies further improves robustness.
Technically, while considering weight-balanced digraphs instead of undirected graphs does not lead to a significant analysis difference, major technical differences here with respect to prior work correspond to the analysis of the unification strategy [13] under time varying networks and time-and-nodevarying step-sizes and spectral strategies [14] under time-varying networks.
This paper is organized as follows.In Section 2 we describe the computational framework that we consider and we present the methods that we analyse.In Section 3 we recall a few preliminary results from the literature and we prove a convergence theorem for the methods introduced in Section 2. In Section 4, we show analytically and by simulation that the unification and spectral step-size selection strategies increase robustness of the methods to time-varying networks and uncoordinated step-sizes.Finally, in Section 5, we conclude the paper and outline some future research directions.

The Model and the Class of Considered Methods
We make the following regularity assumptions for the local cost functions f i .Assumption A1.
• Each function f i : R d → R, i = 1, . . ., n, is twice continuously differentiable; • There exists 0 ≤ µ i ≤ L i such that for every i = 1, . . ., n and every where we write A B if the matrix B − A is positive semi-definite.That is, we assume that each of the local functions is µ i -strongly convex, and has Lipschitz continuous gradient with constant L i .Denoting with L = n i=1 L i and µ = n i=1 µ i , we have that the aggregate function f is µ-strongly convex and ∇f is Lipschitz-continuous with the constant L.
Given x 1 , . . ., x n ∈ R d we define We denote with e the vector of length n with all components equal to 1.For a matrix A ∈ R n×n we denote with ν max (A) the largest singular value of A. Moreover, given a sequence of matrices {M k } k and m ∈ N, let It is assumed that at iteration k the n agents are the nodes of a given network G k = ({1, . . ., n}, E k ), where E k denotes the set of the edges of the network, and to each G k we associate a consensus matrix W k ∈ R n×n .The assumptions over the sequences {G k } and {W k }, which are the same hypotheses considered in [18], are stated below.

Assumption A2.
For every k = 0, 1, . .., G k = ({1, . . ., n}, E k ) is a directed graph and W k is an n × n doubly stochastic matrix with w ij = 0 if i = j and (i, j) / ∈ E k .Moreover, there exists a positive integer m such that sup k=0:m−1 ν k < 1, where Remark 2.1.Assumption A2 is weaker than requiring each graph G k to be connected.For example, it can be proved (see [18]) that in the case of undirected networks, if the sequence is jointly-connected then we can ensure Assumption A2 by taking W k as, e.g., the Metropolis matrix, [31], associated with G k .In more detail, the following can be shown.Assume that the positive entries of the weight matrices W k 's are always bounded from below by a positive constant w-(including also the diagonal entries, i.e., assume that the diagonal entries of W k are always greater than or equal to w).Furthermore, assume network connectedness over bounded intercommunication intervals.That is, for any fixed iteration k, consider the graph whose set of links is the union of the sets of links of graphs at time instances = k − m + 1, ..., k.Assume that G m k is strongly connected, for every k.Now, it is easy to show that the above assumptions imply that ν max W m k − 1 n ee T < 1. 2 We also comment on the role of quantity m on the convergence of (5).Our main result, Theorem 2 ahead, certifies that there exists choice of step size lower and upper bounds d min and d max such that R-linear convergence of the method (5) holds.The result holds for any choice of m.Clearly, the specific values of d min and d max in general depend on m.Intuitively, we can expect that for larger m, the maximal admissible step-size is lower; also, for fixed step-size choices d min and d max that lead to R-linear convergence, larger m leads to slower R-linear convergence, i.e., it leads to a worse R-linear convergence factor.
We consider the following class of methods.Assume that at each iteration node i holds two vectors x k i and u k i in R d and that the global vectors x k , u k ∈ R nd , defined as in (3), are updated according to the following rules: where ) with d k i being the step-size for node i at iteration k and B k is a symmetric n × n matrix that respects the sparsity structure of the communication network G k and such that for every y ∈ R d we have B k (1 ⊗ y) = c(1 ⊗ y) for some c ∈ R.Moreover, we assume that x 0 ∈ R nd is an arbitrary vector and u 0 = 0 ∈ R nd .For B k = 0 and appropriate choice of the step-sizes d k i we get the method introduced in [14].For D k = αI, if B k = bI or B k = bW we retrieve the class of methods analyzed in [13] while if B k = 0 we retrieve the DIGing method proposed in [18,21].For D k = αI and B k = bW with b = 1 α we have the EXTRA method [26], but while this method can be described with this choice of the parameters in equation ( 5), it is not included in the class of methods we consider.Namely, the theoretical analysis that we carry out in Section 3 requires the parameter b to be independent on the step-sizes, thus ruling out the choice b = 1 α that yields EXTRA method.This is in line with 2 To see this, first note that, clearly, matrix W m k is doubly stochastic.Furthermore, it is easy to show (e.g., by induction) that, for any (i, j) ∈ E m k , and for any i = 1, ..., n, we have that [W m k ] ij ≥ w m .This means that W m k is a doubly stochastic matrix with positive diagonal entries, and, moreover, all its off-diagonal entries at the positions that correspond to links of G m k are strictly positive.Using standard arguments on doubly stochastic matrices, this implies that ν max W m k − 1 n ee T .
[28] that shows that EXTRA may not converge in general for time-varying networks.
In our analysis, we consider the case B k = bI and B k = bW k with b nonnegative constant and d min ≤ d k j ≤ d max for every k and every j = 1, . . ., n for appropriately chosen safeguards 0 < d min < d max .
A possible choice for uncoordinated and time-varying step-sizes was proposed in [14] where we have ).Here, P U denotes the projection onto the closed set U, σ min = 1/d max , and σ max = 1/d min .We refer to [14] for details on the derivation and intuition behind this step-size choice.
For static networks, this step size choice incurs no communication overhead per iteration; see [14] However, for time-varying networks, the communication and storage protocol to implement this step size needs to be adapted.One way to ensure at node i and iteration k the availability of s k−1 j for (i, j) ∈ E k , is that node i receives s k−1 j for all j such that (i, j) ∈ E k .That is, each node j per iteration additionally broadcasts one d-dimensional vector s k j to all its current neighbors.Therefore the method described by equation ( 5) combines [13] and [14] into a more general method.

Convergence Analysis
We now study the convergence of the method described in (5).Specifically, denoting with y * the solution of (1) and defining we prove that, if Assumptions A1 and A2 hold, there exist 0 < d min < d max such that the sequence {x k } generated by (5) converges to x * .
Given a vector v ∈ R nd , denote with v the average v = 1 n n j=1 v j ∈ R d and with J the n × n matrix (I − 1 n ee T ), where e T = (1, . . ., 1) ∈ R n .Recalling the definition of x k and u k given in (5), we define the following quantities, which will be used further on: To simplify the notation, in the rest of the section we assume that the dimension d of the problem is given by d = 1, but the same results can be proved analogously in the general case.A few results listed below will be needed for the convergence result presented in this paper.Since W k is doubly stochastic, we have that 1 n ee t (W k − I) = 0. Using this equality and the definition of u k+1 we get and by the initialization u 0 = 0, we have that Directly by the definition of ũk and (9) we get From Assumption A1, for every k there exists a matrix Lemma 1. [18] If the matrix sequence {W k } k satisfies assumption A2, then for every k ≥ m we have where τ = max{|1 − αµ|, |1 − αL|} Following the idea presented in [18], our convergence result relies on the Small Gain Theorem [7], which we now briefly recall.Denote by a := {a k } an infinite sequence of vectors, a k ∈ R d for k = 0, 1, . . . .For a fixed λ ∈ (0, 1) we define If there exists λ ∈ (0, 1) such that for all K = 0, 1, . . ., the following inequalities hold with and lim k→∞ a k = 0 R-linearly.
We will use the following technical Lemma to show that the sequences qk and xk satisfy the hypotheses of Theorem 1. Lemma 3. Given b, µ, L ≥ 0, ν ∈ (0, 1) and n, m ∈ N, where we denote with N the set of positive integers, there exists λ ∈ (0, 1) and 0 ≤ d min < d max such that the following conditions hold: Proof.Take λ m > ν and d min < 2n L so that 1. and 2. hold.For d max > d min and close enough to d min one can ensure that The left hand side expression is an increasing function of ∆ and it is equal to 0 for ∆ = 0. Therefore, taking d max close enough to d min , condition 4. holds.Condition 5. holds for d max < λ m −ν λ m LC .Consider now condition 6., The left hand side expression is an increasing function of d max and taking d max small enough we conclude that the previous inequality holds.Since we need d max > d min , in order to be able to take d max small, we need to take d min small enough, but this can be done without violating the previous conditions.
We have B k = bI or B k = bW k , in both cases, B k x * = bx * , therefore (W k − I)B k x * = 0 and thus For k ≥ m − 1, using (5), the previous equality and (11), we get By (10) and Lemma 1, and by (11), the definition of B k and the fact that W k is doubly stochastic, we get Taking the norm in (14) and using the two previous inequalities, we have that for Notice that the above inequality also holds for the third case considered, i.e. for B k = 0, taking b = 0. Multiplying by 1 λ k+1 , taking the maximum for k = −1 : k − 1, and defining Since by condition 1. in Lemma 3 we have ν < λ m , reordering the terms in the previous inequality and using the fact that q k = xk + eq k , we get Let us now consider qk .
Taking the norm, by Lipschitz continuity of the gradient and denoting with ∆ = d max − d min , we have L , thus Lemma 2 gives a bound for the first term in the right hand side of the last inequality, and we get Multiplying by 1 λ k+1 and taking the maximum for k = −1 : k − 1 we get By Lemma 3 we have τ = 1 − µd min and τ + ∆L < λ, thus reordering and using (15), we get where β 1 and β 2 are defined in Lemma 3. Take From 4. in Lemma 3 we get Finally, let us consider xk .For k ≥ m − 1, by definition of x k , ũk and q k , and equation (11) we have Taking the norm, applying Lemma 1 and (11), we get Multiplying by 1 λ k+1 and taking the maximum for k = −1 : k − 1 we get where ω3 = max k=−1:m−1 xk+1 and β 3 , β 4 , β 5 are defined in Lemma 3. In particular, we have β 3 < 1, and can rearrange the terms of the previous inequality to get Now, applying (15) and 6. from Lemma 3, we obtain We thus proved with γ 2 γ 3 < 1 by condition 7. in Lemma 3. By the Small Gain Theorem, we have that qk and xk converge to 0, and thus q k converges to zeros, which gives the thesis.
The above theorem states the R-linear convergence of the method and hence one might naturally ask what is the convergence factor and how it compares with similar methods, in particular with DIGing.Given that the setting here is rather general -allowing for node-specific and iteration-specific step sizes on time varying networks, the generality of the encompassed methods and settings makes analytical close-form expression of the convergence factor unfeasible.One comparison between the convergence factors of DIGing and a method of the class considered here under restrictive assumptions of a static network is given in [13], Remark 5. Therein, it is shown that the class of methods considered here can have a better convergence factor than DIGing.Specifically, the favorable convergence factor is achieved in [13] for a method instance within the class when parameter b is set differently than the choice that recovers DIGing.Although such comparison is derived for a narrow class of problems, contrary to the more general setting of DIGIng and the setting considered here, with fixed stepsizes and static networks, it serves as an indication for the comparison between DIGing and the method considered here.Furthermore, numerical experiments presented in Section 4 contain the comparison between the method considered here and DIGing and show faster convergence and an increased degree of robustness of the method considered here.

Analytical and Numerical Studies of Robustness of the Methods
Theorem 2 and Lemma 3 ensure convergence of the considered class of methods.Namely, they establish existence of bounds d min < d max such that the methods converge R-linearly under the given assumptions.However they do not provide any information about the difference ∆ = d max − d min and thus about how much the steps employed by different nodes and at different iterations can differ.In this section we try to address this issue by investigating in practice the length of the interval of admissible step-sizes.Firt we show a particular example where the method converges without any upper bound d max , then we present a set of numerical results that show how the step bounds influence the convergence and the performance of the methods.
We consider the same framework considered in [14] (section 4.2) and we prove that even if we allow the consensus matrix to change from iteration to iteration, the method converges.Consider the following objective function and assume that at iteration k the consensus matrix is given by 4 ) for every k, and that {x k } is the sequence generated by (5) with b = 0, e T x 0 = e T a and e T (u 0 + ∇F (x 0 )) = 0.If d k i = α for every i = 1, . . ., n and for every k, then the method converges Rlinearly to the solution of (17) if α min ≤ α ≤ 2 3 and α min > 0 small enough.On the other hand, for any α > 2, there exists a sequence {θ k }, k = 0, 1, 2, . . .that satisfies the assumptions of the Lemma such that the method diverges, i.e., x k → ∞.
Proof.In the case we are considering, ( 5) is equivalent to where Let us consider the case with fixed step-size d k i = α and let us denote with ξ k the vector (q k , z k ) ∈ R 2n .We can see that for every k we have ξ k+1 = A k ξ k where the matrix A k is given by In order to prove the first part of the Lemma, it is enough to show that there exists µ < 1 such that A k 2 2 < µ for every iteration index k.That is, we have to prove that there exists µ < 1 such that the spectral radius of A T k A k is smaller than µ for every k.Denoting with 1, λ k 2 , . . ., λ k n the eigenvalues of W k , it can be proved that the eigenvalues of A T k A k are given by the eigenvalues of the 2 × 2 matrices M k i defined as By direct computation we can see that the eigenvalues of M k 1 are given by 0 and 2α 2 − 2α + 1 < 1 − 2 3 α min and therefore it is enough to take µ > 1 − 2 3 α min .Denoting with p k i (t) the characteristic polynomial of D k i we can see that with the values of θ min , θ max and α max given by the assumptions, we can always find 1 − 2 3 α min < µ < 1 such that p k i (µ) > 0 and p k i (−µ) > 0 and thus such that the eigenvalues of M k i belong to (−µ and µ) for every k and for every i = 1, . . ., n.To prove that if α > 2 the method is in general not convergent it is enough to consider the case when θ k = θ 0 for every iteration index k.In this case we have that A k = A 0 for every k and thus ξ k = A k 0 ξ 0 .In this case we can see [14] that 1 − α is an eigenvalue of A 0 an therefore if α > 2 we have that ρ(A 0 ) > 1 and thus the sequence {ξ k } does not converge.This concludes the first part of the proof.
Assume now that the step-sizes are computed as in (18).Proceeding as in the proof of Proposition 4.3 in [14] we can prove that σ k+1 i = σ k+1 for every i with σ k+1 given by By using the fact that θ k > 1/3 and σ max = 3/2 we can prove that there exists k such that σ k = σ max for every k > k.Therefore, for k > k the step-size becomes the same for all nodes and equal to d k i = σ −1 max = 2/3 and thus the method converges by the first part of the Lemma.
The above Lemma certifies convergence of the spectral-like method [14] for time-varying networks and a very specific problem structure with all-toall communication network and consensus quadratic costs.It is worth noting that, for generic quadratic cost functions and sparse time-varying networks, an upper bound on the step-size is necessary (see Figures 1 and 2 below).We now make an analogy on the achieved results for the distributed spectral-like method [14] and the spectral (Barzilei-Borwein) gradient method from the centralized optimization.Namely, in centralized settings, the spectral gradient method's convergence without step size safeguarding has been proved only for a strictly convex quadratic cost function.In the case of generic functions beyond strictly convex quadratic, some safeguards ∆ min and ∆ max are necessary, even in the centralized case.Though, in the centralized case, these safeguards can be arbitrarily small (∆ min ) and arbitrarily large (∆ max ).Therefore, the need for safeguards is to be expected in the distributed optimization scenario as well.This matches with the results that we present here.It turns our that the price to be payed in the distributed time-varying networks scenario is two-fold: 1) the no-safeguards case happens in a more restricted cost functions setting, namely the consensus quadratic costs (see Lemma 4); and 2) the safeguard step size bounds in the general case are no longer arbitrary and take a network-dependent form.
We also have the following Lemma where we continue to assume the consensus problem but relax the requirement that the network is fully connected at all times.When the network is not fully connected, in general we need safeguarding for global convergence.However, as explained below, the following Lemma sheds some light on the behavior of the spectral-like distributed method.While it is not to be considered as a global convergence result, it highlights that the next step size has a controlled length provided that the current solution estimate is close to consensus.Lemma 5. Let us assume that the objective function is given by (17), and that x 0 , z 0 are such that e T x 0 = e T a, z 0 i = ∇f (x 0 i ) = x 0 i .Moreover, for every i = 1, . . ., n let the local stepsize d k i be defined as d 0 i = d 0 > 0 and, for every k ≥ 1, d k i = 1/σ k i , with where s k j = x k+1 j − x k j .Moreover, let us assume that at each iteration assumption A2 holds with m = 1.

Given any
then d 1 i ≤ d for every i = 1, . . ., n. Proof.Let us denote with J ∈ R n×n the matrix 1 n ee t and with v k = s k − es k .From the assumptions and the double stochasticity of the matrix W k , we have Where we defined ε = (ν 0 + d 0 )ε.Moreover, These imply that, for every j = 1, . . ., n It's easy to see that the first inequality, together with the assumption over ε, imply σ 1 i ≥ 1/ d, which in turn implies the thesis.
Intuitively, the Lemma above says that, for the considered problem, if algorithm (5) with stepsize (20) starts from a point close to consensus (i.e., a point where solution estimates across different nodes are mutually close), then the next step size at each node will not be too large.More precisely, the size of the next step size is controlled by the consensus neighborhood ˆ that we start from.In other words, if the next step size is to be upper bounded by an arbitrary constant d > 1, we can find a problem-dependent constant ˆ such that, starting at most ˆ away from consensus, the next step size at each node is at most d.To further explain this, suppose that all the quantities s k j /s k i 's are -close to one, |s k j /s k i | ∈ (1 − , 1 + ), for all nodes i, j.Then, in view of (21), quantity σ k+1 i , for all nodes i, is approximated as: In other words, for the special case of the consensus problem, provided that all the quantities s k j /s k i 's are -close to one, the next step-size 1/σ k+1 i is in a neighborhood of one, and is hence bounded.We now present some numerical results.We consider the problem of minimizing a logistic loss function with l 2 regularization, that is, we assume the local objective function f i at node i is given by where a i ∈ R d , b i ∈ {−1, 1} and R > 0. We compare 3 different choices of the matrix B in (5) and three different definitions of the step-sizes d k i , resulting in nine methods.For increasing values of d max we run each method on the given problem and we plot in Figure 1 the number of iterations necessary to arrive at convergence.The problem is generated as follows.The convergence analysis we carried out in Section 3 does not rely on any particular definition of the step-sizes d k i , therefore we need to specify how each node chooses the step-size at each iteration.We consider here two cases.The first one, referred to as spectral in Figure 1, is the case where d k i = (σ k i ) −1 with σ k i as in (18).The second case we consider is the one where each node performs local line search by employing a backtracking strategy starting at d max to satisfy classical Armijo condition on the local objective function.That is, to satisfy We refer to this method as line search.It is worth noting that there are no convergence guarantees for the line search method.The rationale for including a comparison with it is to show that the method [14] exhibits a significantly higher degree of robustness with respect to a meaningful, time-varying and node-varying, local step size strategy that can be employed.For comparison, we also consider the method with fixed step-size d k i = d max for every k and every i = 1, . . ., n.The choices of the matrix B k are given by B k = 0 (plot (a) in Figure 1), , where for the case B = 0 the choice is made following [13].Notice that the case d k i = d max and B k = 0 corresponds to [18,21] with constant, coordinated step-sizes.We consider increasing values of d max in [ 1 50L , 10 L ], while we fix d min = 10 −8 as, in the considered framework, we saw that its choice does not influence the performance of the methods significantly.
In Figure 1 we plot the results in the case where the underlying network is symmetric and timevarying, defined as follows: we consider a network G with n = 25 nodes undirected and connected, generated as a random geometric graph with communication radius n −1 ln(n), and we define the sequence of networks {G k } by deleting each edge with probability 1  4 .We carried out analogous tests in the cases where G is symmetric and constant and in the case where it is given by a directed ring.The obtained results were comparable to the ones that we present.We also observed in practice that double stochasticity of the underlying network appears to be essential for the convergence of the considered methods.We set the dimension d as equal to 10 and we generate the quantities involved in the definition of the local objective functions (22) as follows.For i = 1, . . ., n we define a i = (a i1 , . . ., a i,d−1 , 1) T where the components a ij are independent and come from the standard normal distribution, and b i = sign(a T i y * + i ) where y * ∈ R d with independent components drawn from the standard normal distribution, and i are generated according to the normal distribution with mean 0 and standard deviation 0.4.Finally, we take the regularization parameter R = 0.25.The initial vectors x 0 i are generated independently, with components drawn from the uniform distribution on [0, 1], and at each iteration we define the consensus matrix W k as the Metropolis matrix [31].
We are interested in the number of iterations required by each method to reach a prescribed accuracy.More precisely, we evaluate the iteration number k at which max i=1,...,n x k i − y * < ε, where ε = 10 −5 .In Figure 1, on the x-axis we show the upper bound d max while on the y-axis we show k for each method.To facilitate the comparison among the methods, in Figure 2 we plot the same results, with y-axis cut at 2000.We can see from Figure 1 that for all considered choices of the matrix B the spectral method allows for maximum step-size that is at least 10 times  larger than the method with fixed step-size, while line search allows for maximum step-size equal to 2 and 3 times the maximum step-size allowed by the method with fixed steplength, for B equal to 0 and bI or bW respectively.Moreover, we can see that choosing B = bI seems to increase the maximum value of d max that yields convergence for all the considered methods.Finally, in Figure 2, we can notice that for most of the tested values of d max the spectral methods requires a smaller number of iterations than the method with fixed step-size.That is, in the considered framework, using uncoordinated time-varying step-sizes given by [14] helps to significantly improve the robustness of the method and also the performance.Notice also that the spectral step-size strategy exhibits a "stable", practically unchanged, performance for a wide range of d max ; hence, it is not sensitive to tuning of d max .This is in contrast with the constant step-size strategy that is very sensitive to the step-size choice d max .It is also worth noting that Theorem 2 requires a conservative upper bound on the step-size d max and a conservative upper bound on step-size differences ∆ and that both depend on multiple global system parameters (Lemma 3).However, simulations presented here and other extensive numerical studies suggest that an a priori upper bound on ∆ is not required for convergence.In addition, d min can be set to a small value independent of system parameters, e.g., d min = 10 −8 , and setting d max requires only a coarse upper bound on quantity 1/L.
holds.By the previous inequality, we have 1 − µd min + ∆L < 1 and therefore, for fixed d max and d min we can always take λ ∈ (0, 1) such that 3. is satisfied and 1. still holds.Moreover, we can take d min arbitrarily small and d max arbitrarily close to d min without violating conditions 1.-3.Notice that C = λ(1−λ m ) 1−λ is an increasing function of λ.Let us now consider condition 4. given by

Theorem 2 .
Let B k be defined as B k = bW k or B k = bI for a positive constant b, or B k = 0.If Assumptions A1 and A2 hold then there exists d min < d max such that the sequence {x k } generated by (5) converges Rlinearly to x * .Proof.Define ν = sup k=0:m−1 ν k < 1 where ν k , m are given in assumption A2,