1 Introduction

We consider a set of n computational agents and the following unconstrained optimization problem

$$\begin{aligned} \min _{{\mathbb {R}}^d} f(y) \qquad f(y) = \sum _{j=1}^n f_j(y) \end{aligned}$$
(1)

where each \(f_j\) is a real-valued function of \({\mathbb {R}}^{d}\) and is held privately by one of the agents, and the agents can communicate according to a given network. Problems of this form arise in many practical applications such as sensor networks [15], distributed control [17], distributed learning [4] and many others. Several distributed methods [10,11,12,13, 18, 21, 26, 28, 35] have been proposed in literature for the solution of (1) that achieve exact convergence to the minimizer with fixed step-size, when the objective function is convex and Lipschitz-differentiable. In [18, 21, 26] two exact gradient-based methods where proposed, and the convergence was proved for the case where the underlying network is undirected, connected, and remains constant through the entire execution of the algorithm. In [13] a unified analysis of a class of first-order distributed methods is presented. In [28] the convergence of several first-order methods was generalized to the case of a time-varying network, provided that the network is connected at each iteration, while in [18] the convergence analysis of [21] is extended to the time-varying and directed case, assuming joint-connectivity of the sequence of networks and weight-balance of each graph.Footnote 1 Interestingly, the exact first order methods are also related with augmented Lagrangian algorithms, e.g., [20]. For example, the methods in [13, 18, 21] have been shown to be equivalent to certain primal-dual methods that optimize an augmented Lagrangian function associated with the original problem. In [24] an accelerated gradient-based method for the time-varying directed case is proposed, with weaker assumptions over the underlying networks. In [16] the authors propose a mirror descent method that assumes time-varying jointly-strongly-connected networks. In [8] the authors considered the problem of minimizing \(f(y)+G(y)\) over a closed and convex set \({\mathcal {K}}\), where f is a possibly non-convex function as in (1) and G is a convex non-separable term, and they propose a gradient-tracking method that achieves convergence in the case of time-varying directed jointly-connected networks for diminishing synchronized step sizes. In [25] the method proposed in [8] is extended with constant step-sizes to a more general framework while in [27] R-linear convergence is proved for [25] with strongly convex f(y). A unifying framework of these methods is presented in [34] and, for the case of constant and undirected networks, in [1]. In all the above methods the sequence of the step-sizes is assumed to be fixed and coordinated among all the agents. In [19, 32, 33, 35, 36] the case of uncoordinated time-constant step sizes is considered, that is, each node has a different step-size but these step sizes are constant in all iterations. In [14] a modification of [21] is proposed, with step-sizes varying both across nodes and iterations, and it is proved that there exist suitable safeguards for the steps, depending on the regularity properties of the objective function and the network, such that R-linear convergence of the generated sequence to the solution of (1) holds. This result is obtained for undirected and stationary network. In [29, 30] asynchronous modifications of [25] are proposed.

Of special interest to the current paper is the spectral gradient method (or Barzilai and Borwein method). This method is very popular in centralized optimization due to its efficiency, as reported in numerous studies, for example [5, 9]. In general, the method avoids the famous zig-zag behaviour of the steepest descent and converges much faster. The method was first proposed by Barzilai and Borwein [3]. This reference proves the method’s convergence for two-dimensional problems and convex quadratic functions. The analysis is then extended to arbitrary dimensions and convex quadratic functions by Raydan [22]. Minimization of generic functions is considered in [23] in combination with a nonmonotone line search. R-linear convergence for convex quadratic functions has been proved in [6]. In summary, despite its excellent numerical performance, spectral gradient methods are proved to converge without any safeguarding lower and upper bounds on the step size only for strongly convex quadratic costs. Convergence for generic functions beyond convex quadratic is proved only under step size safeguarding, coupled with a line search strategy. Distributed variants of spectral gradient methods and fixed network topologies are studied in [14].

We now summarize this paper’s contributions. We establish R-linear convergence of a class of exact distributed first-order methods under the general setting of time-varying directed weight-balanced networks, without the requirement of network connectedness at each iteration, and in the presence of time-varying uncoordinated step sizes. While there have been several existing studies of exact distributed methods under general settings, our study implies several new contributions to the literature; these contributions cannot be derived from existing works and are novelties of this paper.

  • We prove that the methods proposed in [13], referred to here (and also in [28]) as the unified Extra and the unified DIGing are robust to time-varying directed networks and time-varying uncoordinated step sizes, i.e., they converge R-linearly in this setting. Up to now, it is only known that these methods converge under static undirected networks [13] or time-varying networks where the network is connected at each iteration [28]. These methods have been previously considered only for time-invariant coordinated step sizes.

  • We prove that the method proposed in [14] is robust to time-varying directed networks. Before the current paper, the method was only known to converge for static, undirected networks.

  • It is shown in [28] that the Extra method [26] may diverge over time-varying networks, even when the network is connected at every iteration. On the other hand, as we show here, the unified Extra, a variant of Extra proposed in [13], is robust to time-varying networks. Hence, our results reveal that the unified Extra can be considered as a mean to modify Extra and make it robust.

  • We provide a thorough numerical study and an analytical study for a special problem structure that demonstrates that the unification strategy in [13] and the spectral gradient-like step-size selection strategy in [14] exhibit a high degree of robustness to time-varying networks and uncoordinated time-varying step-sizes. More precisely, we show that these strategies converge, when working on time-varying networks, for wider step-size ranges than commonly used strategies such as constant coordinated step-sizes and DIGing algorithmic forms. In addition, we show by simulation that actually a combination of the unification and the spectral step-size strategies further improves robustness.

Technically, while considering weight-balanced digraphs instead of undirected graphs does not lead to a significant analysis difference, major technical differences here with respect to prior work correspond to the analysis of the unification strategy [13] under time varying networks and time-and-node-varying step-sizes and spectral strategies [14] under time-varying networks.

This paper is organized as follows. In Sect. 2 we describe the computational framework that we consider and we present the methods that we analyse. In Sect. 3 we recall a few preliminary results from the literature and we prove a convergence theorem for the methods introduced in Sect. 2. In Sect. 4, we show analytically and by simulation that the unification and spectral step-size selection strategies increase robustness of the methods to time-varying networks and uncoordinated step-sizes. Finally, in Sect. 5, we conclude the paper and outline some future research directions.

2 The model and the class of considered methods

We make the following regularity assumptions for the local cost functions \(f_i\).

Assumption A1

  • Each function \(f_{i}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), \(i=1,\dots ,n,\) is twice continuously differentiable;

  • There exists \(0\le \mu _i\le L_i\) such that for every \(i=1,\ldots ,n\) and every \(y \in {\mathbb {R}}^d,\)

    $$\begin{aligned} \mu _i I \preceq \nabla ^2 f_i (y) \preceq L_i I \end{aligned}$$
    (2)

where we write \(A\preceq B\) if the matrix \(B-A\) is positive semi-definite. That is, we assume that each of the local functions is \(\mu _i\)-strongly convex, and has Lipschitz continuous gradient with constant \(L_i\). Denoting with \(L= \sum _{i=1}^n L_i\) and \(\mu = \sum _{i=1}^{n} \mu _i,\) we have that the aggregate function f is \(\mu\)-strongly convex and \(\nabla f\) is Lipschitz-continuous with the constant L.

Given \(x_1,\dots ,x_n\in {\mathbb {R}}^{d}\) we define

$$\begin{aligned} x:=\left( \begin{matrix} x_1\\ \vdots \\ x_n \end{matrix}\right) \in {\mathbb {R}}^{nd} \qquad F(x) = \sum _{j=1}^n f_j(x_j). \end{aligned}$$
(3)

We denote with e the vector of length n with all components equal to 1. For a matrix \(A\in {\mathbb {R}}^{n\times n}\) we denote with \(\nu _{\max }(A)\) the largest singular value of A. Moreover, given a sequence of matrices \(\{M^k\}_k\) and \(m\in {\mathbb {N}}\), let

$$\begin{aligned} M^k_m:=M^kM^{k-1}\dots M^{k-m+1}, \,\, M^k_0 = I \end{aligned}$$
(4)

It is assumed that at iteration k the n agents are the nodes of a given network \(G^k = (\{1,\dots ,n\}, E_k)\), where \(E_k\) denotes the set of the edges of the network, and to each \(G^k\) we associate a consensus matrix \(W^k\in {\mathbb {R}}^{n\times n}\). The assumptions over the sequences \(\{G^k\}\) and \(\{W^k\},\) which are the same hypotheses considered in [18], are stated below.

Assumption A2

For every \(k =0,1,\ldots\), \(G^k= (\{1,\dots ,n\}, E_k)\) is a directed graph and \(W^k\) is an \(n\times n\) doubly stochastic matrix with \(w_{ij}=0\) if \(i\ne j\) and \((i,j)\notin E_k\). Moreover, there exists a positive integer m such that \(\sup _{k=0:m-1}\nu _k<1\), where \(\nu _k = \nu _{max}(W^k_m-\frac{1}{n}ee^T)\).

Remark 2.1

Assumption A2 is weaker than requiring each graph \(G^k\) to be connected. For example, it can be proved (see [18]) that in the case of undirected networks, if the sequence is jointly-connected then we can ensure Assumption A2 by taking \(W^k\) as, e.g., the Metropolis matrix, Xiao et al. [31], associated with \(G^k.\) In more detail, the following can be shown. Assume that the positive entries of the weight matrices \(W_k\)’s are always bounded from below by a positive constant \(\underline{w}\)- (including also the diagonal entries, i.e., assume that the diagonal entries of \(W_k\) are always greater than or equal to \(\underline{w}\)). Furthermore, assume network connectedness over bounded intercommunication intervals. That is, for any fixed iteration k, consider the graph \(G_{k}^m = (\{1,\dots ,n\}, E_k^m)\), \(E_k^m = \cup _{\ell =k-m+1}^{k}E_k\), whose set of links is the union of the sets of links of graphs at time instances \(\ell =k-m+1,...,k\). Assume that \(G_{k}^m\) is strongly connected, for every k. Now, it is easy to show that the above assumptions imply that \(\nu _{\textrm{max}} \left( W_k^m-\frac{1}{n}ee^T\right) <1\).Footnote 2

We also comment on the role of quantity m on the convergence of (5). Our main result, Theorem 2 ahead, certifies that there exists choice of step size lower and upper bounds \(d_{min}\) and \(d_{max}\) such that R-linear convergence of the method (5) holds. The result holds for any choice of m. Clearly, the specific values of \(d_{min}\) and \(d_{max}\) in general depend on m. Intuitively, we can expect that for larger m, the maximal admissible step-size is lower; also, for fixed step-size choices \(d_{min}\) and \(d_{max}\) that lead to R-linear convergence, larger m leads to slower R-linear convergence, i.e., it leads to a worse R-linear convergence factor.

We consider the following class of methods. Assume that at each iteration node i holds two vectors \(x_i^k\) and \(u_i^k\) in \(R^d\) and that the global vectors \(x^k,\ u^k\in {\mathbb {R}}^{nd}\), defined as in (3), are updated according to the following rules:

$$\begin{aligned} {\left\{ \begin{array}{ll} x^{k+1} = {\mathcal {W}}^kx^k-D^k(u^k+\nabla F(x^k))\\ u^{k+1} = u^k + ({\mathcal {W}}^k-I)(\nabla F(x^k)+u^k-B^kx^k) \end{array}\right. } \end{aligned}$$
(5)

where \({\mathcal {W}}^k:=(W^k\otimes I)\in {\mathbb {R}}^{nd\times nd}\), \(D^k = diag(d^k_1I,\dots ,d^k_nI)\) with \(d^k_i\) being the step-size for node i at iteration k and \(B^k\) is a symmetric \(n\times n\) matrix that respects the sparsity structure of the communication network \(G^k\) and such that for every \(y\in {\mathbb {R}}^d\) we have \(B^k(1\otimes y) = c(1\otimes y)\) for some \(c\in {\mathbb {R}}\). Moreover, we assume that \(x^0\in {\mathbb {R}}^{nd}\) is an arbitrary vector and \(u^0 = 0\in {\mathbb {R}}^{nd}\).

For \(B^k = 0\) and appropriate choice of the step-sizes \(d^k_i\) we get the method introduced in [14]. For \(D^k = \alpha I\), if \(B^k = bI\) or \(B^k = b{\mathcal {W}}\) we retrieve the class of methods analyzed in [13] while if \(B_k = 0\) we retrieve the DIGing method proposed in [18, 21]. For \(D^k = \alpha I\) and \(B^k = b{\mathcal {W}}\) with \(b = \frac{1}{\alpha }\) we have the EXTRA method [26], but while this method can be described with this choice of the parameters in Eq. (5), it is not included in the class of methods we consider. Namely, the theoretical analysis that we carry out in Sect. 3 requires the parameter b to be independent on the step-sizes, thus ruling out the choice \(b=\frac{1}{\alpha }\) that yields EXTRA method. This is in line with [28] that shows that EXTRA may not converge in general for time-varying networks.

In our analysis, we consider the case \(B^k = bI\) and \(B^k = b{\mathcal {W}}^k\) with b non-negative constant and \(d_{min}\le d^k_j\le d_{max}\) for every k and every \(j=1,\dots ,n\) for appropriately chosen safeguards \(0<d_{min}<d_{max}.\)

A possible choice for uncoordinated and time-varying step-sizes was proposed in [14] where we have \(d^k_i = (\sigma _i^k)^{-1}\) with \(\sigma _i^k\) given by

$$\begin{aligned} \sigma ^{k}_i = {{\mathcal {P}}}_{[\sigma _{min}, \sigma _{max}]} \left( \frac{(s_i^{k-1})^{T}y_i^{k-1}}{(s_i^{k-1})^{T}s_i^{k-1}} +\sigma ^{k-1}_i\sum _{j=1}^nw^k_{ij}\left( 1-\frac{(s_i^{k-1})^{T} s_j^{k-1}}{(s_i^{k-1})^{T}s_i^{k-1}}\right) \right) \end{aligned}$$

where \(s^{k-1}_i = x^{k}_i-x^{k-1}_i\) and \(y^{k-1}_i = \nabla f_i(x^k_i) - \nabla f_i(x^{k-1}_i)\). Here, \({{\mathcal {P}}}_U\) denotes the projection onto the closed set U, \(\sigma _{min}=1/d_{max}\), and \(\sigma _{max} =1/d_{min}\). We refer to [14] for details on the derivation and intuition behind this step-size choice. For static networks, this step size choice incurs no communication overhead per iteration; see [14] However, for time-varying networks, the communication and storage protocol to implement this step size needs to be adapted. One way to ensure at node i and iteration k the availability of \(s_j^{k-1}\) for \((i,j)\in E^k\), is that node i receives \(s_j^{k-1}\) for all j such that \((i,j) \in E^k\). That is, each node j per iteration additionally broadcasts one d-dimensional vector \(s_j^k\) to all its current neighbors. Therefore the method described by Eq. (5) combines [13, 14] into a more general method.

3 Convergence analysis

We now study the convergence of the method described in (5). Specifically, denoting with \(y^*\) the solution of (1) and defining

$$\begin{aligned} x^*:=\left( \begin{matrix} y^*\\ \vdots \\ y^* \end{matrix}\right) \in {\mathbb {R}}^{nd} \end{aligned}$$

we prove that, if Assumptions A1 and A2 hold, there exist \(0<d_{min}<d_{max}\) such that the sequence \(\{x^k\}\) generated by (5) converges to \(x^*\).

Given a vector \(v\in {\mathbb {R}}^{nd},\) denote with \(\bar{v}\) the average \(\bar{v}=\frac{1}{n}\sum _{j=1}^{n}v_j\in {\mathbb {R}}^{d}\) and with J the \(n\times n\) matrix \((I-\frac{1}{n}ee^T)\), where \(e^T =(1,\dots ,1)\in {\mathbb {R}}^{n}\). Recalling the definition of \(x^k\) and \(u^k\) given in (5), we define the following quantities, which will be used further on:

$$\begin{aligned}&\tilde{x}^k = x^k-e\bar{x}^k\in {\mathbb {R}}^{nd},\end{aligned}$$
(6)
$$\begin{aligned}&\tilde{u}^k = u^k+\nabla F(x^*)\in {\mathbb {R}}^{nd},\end{aligned}$$
(7)
$$\begin{aligned}&q^k = x^k-x^* = \tilde{x}^k+e\bar{q}^k\in {\mathbb {R}}^{nd}. \end{aligned}$$
(8)

To simplify the notation, in the rest of the section we assume that the dimension d of the problem is given by \(d=1\), but the same results can be proved analogously in the general case.

A few results listed below will be needed for the convergence result presented in this paper. Since \(W^k\) is doubly stochastic, we have that \(\frac{1}{n}ee^t(W^k-I) = 0\). Using this equality and the definition of \(u^{k+1}\) we get

$$\begin{aligned}&\bar{u}^{k+1} = \frac{1}{n}ee^Tu^{k+1} = \frac{1}{n}ee^Tu^k +\frac{1}{n}ee^T(W^k-I)(u^k+\nabla F(x^k) - B^kx^k) = \bar{u}^k \end{aligned}$$

and by the initialization \(u^0=0\), we have that

$$\begin{aligned} \bar{u}^k = 0. \end{aligned}$$
(9)

Directly by the definition of \(\tilde{u}^k\) and (9) we get

$$\begin{aligned} \frac{1}{n}ee^T\tilde{u}^{k} = \frac{1}{n}ee^T(u^{k} + \nabla F(x^*)) = \bar{u}^k + \frac{1}{n}\nabla f(y^*) = 0. \end{aligned}$$
(10)

From Assumption A1, for every k there exists a matrix \(H_k\preceq LI\) such that

$$\begin{aligned} \nabla F(x^k)-\nabla F(x^*)=H_k(x^k-x^*). \end{aligned}$$
(11)

Lemma 1

[18] If the matrix sequence \(\{W^k\}_k\) satisfies Assumption A2, then for every \(k\ge m\) we have

$$\begin{aligned} \left\| JW^k_my\right\| \le \nu _k\left\| Jy\right\| \end{aligned}$$

Lemma 2

[2] If the function f satisfies Assumption A1and \(0<\alpha <\frac{1}{L}\), then

$$\begin{aligned} \Vert y-\alpha \nabla f(y)-y^*\Vert \le \tau \Vert y-y^*\Vert \end{aligned}$$

where \(\tau = \max \{|1-\alpha \mu |,|1-\alpha L|\}\)

Following the idea presented in [18], our convergence result relies on the Small Gain Theorem [7], which we now briefly recall. Denote by \(\textbf{a}:=\{a^k\}\) an infinite sequence of vectors, \(a^k \in {\mathbb {R}}^d\) for \(k=0,1,\ldots\). For a fixed \(\lambda \in (0,1)\) we define

$$\begin{aligned}{} & {} \Vert \textbf{a}\Vert ^{\lambda , K} = \max _{k=0,1,\ldots , K} \left\{ \frac{1}{\lambda _k}\Vert a^k\Vert \right\} \\{} & {} \Vert \textbf{a}\Vert ^{\lambda } = \sup _{k \ge 0} \left\{ \frac{1}{\lambda _k}\Vert a^k\Vert \right\} . \end{aligned}$$

Theorem 1

[7] Let \(\textbf{a} = \{a^k\}\) and \(\textbf{b} =\{b^k\}\) be two vector sequences, with \(a^k, b^k \in {\mathbb {R}}^d.\) If there exists \(\lambda \in (0,1)\) such that for all \(K = 0,1,\ldots ,\) the following inequalities hold

$$\begin{aligned}&\Vert \textbf{a}\Vert ^{\lambda ,K} \le \gamma _1 \Vert \textbf{b}\Vert ^{\lambda , K} +w_1,\nonumber \\&\Vert \textbf{b}\Vert ^{\lambda ,K} \le \gamma _2 \Vert \textbf{a}\Vert ^{\lambda , K} + w_2, \end{aligned}$$
(12)

with \(\gamma _1\cdot \gamma _2 \in [0,1)\), then

$$\begin{aligned} \Vert \textbf{a}\Vert ^{\lambda } \le \frac{1}{1-\gamma _1 \gamma _2}(w_1 \gamma _2 + w_2). \end{aligned}$$

and

$$\begin{aligned} \lim _{k \rightarrow \infty } a^k =0\,\,\,\text {R-linearly}. \end{aligned}$$

We will use the following technical Lemma to show that the sequences \(\Vert \bar{q}^k\Vert\) and \(\Vert \tilde{x}^k\Vert\) satisfy the hypotheses of Theorem 1.

Lemma 3

Given \(b, \mu , L \ge 0\), \(\nu \in (0,1)\) and \(n,m\in {\mathbb {N}}\), where we denote with \({\mathbb {N}}\) the set of positive integers, there exists \(\lambda \in (0,1)\) and \(0\le d_{min}<d_{max}\) such that the following conditions hold:

  1. 1.

    \(\nu <\lambda ^m;\)

  2. 2.

    \(\frac{d_{min}}{n}<\frac{2}{L};\)

  3. 3.

    \(1-\mu d_{min}+\Delta L<\lambda ;\)

  4. 4.

    \(\gamma \beta _2<1;\)

  5. 5.

    \(\beta _3<1;\)

  6. 6.

    \(\frac{\beta _5\gamma }{1-\beta _3}<1;\)

  7. 7.

    \(\frac{\beta _1+\gamma \beta _2}{1-\gamma \beta _2} \cdot \frac{\beta _4+\gamma \beta _5}{1-\beta _3-\gamma \beta _5}<1,\)

where

$$\begin{aligned}&\gamma = \frac{(b+L)C}{\lambda ^m-\nu }, \quad \beta _1 = \frac{Ld_{max}}{\lambda -1+\mu d_{min}-\Delta L},\\&\beta _2 = \frac{\Delta }{Ld_{max}}\beta _1, \quad \beta _3 = \frac{\nu }{\lambda ^m}+\beta _4, \\&\beta _4 = L\beta _5, \quad \beta _5 = \frac{Cd_{max}}{\lambda ^m},\\&\Delta = d_{max}-d_{min}, \quad C = \frac{\lambda (1-\lambda ^m)}{1-\lambda }. \end{aligned}$$

Proof

Take \(\lambda ^m>\nu\) and \(d_{min}<\frac{2n}{L}\) so that 1 and 2 hold. For \(d_{max}>d_{min}\) and close enough to \(d_{min}\) one can ensure that

$$\begin{aligned} \frac{d_{max}}{d_{min}}<1+\frac{\mu }{L} \end{aligned}$$
(13)

holds. By the previous inequality, we have \(1-\mu d_{min}+\Delta L<1\) and therefore, for fixed \(d_{max}\) and \(d_{min}\) we can always take \(\lambda \in (0,1)\) such that 3 is satisfied and 1 still holds. Moreover, we can take \(d_{min}\) arbitrarily small and \(d_{max}\) arbitrarily close to \(d_{min}\) without violating conditions 1–3 Notice that \(C = \frac{\lambda (1-\lambda ^m)}{1-\lambda }\) is an increasing function of \(\lambda\).

Let us now consider condition 4 given by

$$\begin{aligned} \frac{(b+L)C\Delta }{(\lambda ^m-\nu )(\lambda -1+\mu d_{min}-\Delta L)}<1. \end{aligned}$$

The left hand side expression is an increasing function of \(\Delta\) and it is equal to 0 for \(\Delta =0.\) Therefore, taking \(d_{max}\) close enough to \(d_{min}\), condition 4 holds.

Condition 5 holds for \(d_{max}<\frac{\lambda ^m-\nu }{\lambda ^mLC}\).

Consider now condition 6,

$$\begin{aligned} \frac{(b+L)C^2d_{max}}{(\lambda ^m-\nu )(\lambda ^m-\nu ) (\lambda ^m-\nu -Ld_{max}C)}<1 \end{aligned}$$

The left hand side expression is an increasing function of \(d_{max}\) and taking \(d_{max}\) small enough we conclude that the previous inequality holds. Since we need \(d_{max}>d_{min}\), in order to be able to take \(d_{max}\) small, we need to take \(d_{min}\) small enough, but this can be done without violating the previous conditions.

By definition, \(\frac{\beta _2+\gamma \beta _3}{1-\gamma \beta _3}\) and \(\frac{\beta _5+\gamma \beta _6}{1-\beta _4-\gamma \beta _6}\) are also increasing functions of \(d_{max}\) and \(\Delta\). Thus, we can apply the same reasoning that we applied to 4 and 6 to get \(\gamma _2<1\) and \(\gamma _3<1\). In particular, we can take \(d_{min}\) and \(d_{max}\) such that condition 7 holds. \(\square\)

Theorem 2

Let \(B^k\) be defined as \(B^k = b W^k\) or \(B^k = bI\) for a positive constant b,  or \(B^k = 0.\) If Assumptions A1 and A2 hold then there exists \(d_{min}<d_{max}\) such that the sequence \(\{x^k\}\) generated by (5) converges R-linearly to \(x^*.\)

Proof

Define \(\nu = \displaystyle \sup _{k=0:m-1}\nu _k<1\) where \(\nu _k,\ m\) are given in Assumption A2, and take \(\lambda \in (0,1), 0\le d_{min}<d_{max}\) given by Lemma 3. We prove that \(n^{1/2}\bar{q}^k\) and \(\tilde{x}^k\) satisfy inequalities (12), thus ensuring R-linear convergence by Theorem 1.

We have \(B^k=bI\) or \(B^k = bW^k\), in both cases, \(B^kx^* = bx^*\), therefore \((W^k-I)B^kx^* = 0\) and thus

$$\begin{aligned} (W^k-I)B^kx^k = (W^k-I)B^k(x^k-x^*) = (W^k-I)B^kq^k \end{aligned}$$

For \(k\ge m-1\), using (5), the previous equality and (11), we get

$$\begin{aligned} \tilde{u}^{k+1}&=u^{k+1}+\nabla F(x^{{*}}) \nonumber \\&= u^k + (W^k-I)(u^k+\nabla F(x^{{k}})-B^kx^k)+\nabla F(x^{{*}})\nonumber \\&=W^k(u^k+\nabla F(x^{{*}}))+(W^k-I)(\nabla F(x^{{k}})-\nabla F(x^{{*}}))\nonumber \\&\quad +-(W^k-I)B^kx^k\nonumber \\&=W^k\tilde{u}^{k}+(W^k-I)H_kq^k-(W^k-I)B^kq^k \nonumber \\&=W^k_m\tilde{u}^{k-m+1}+\sum _{t=0}^{m-1}W^k_t(W^{k-t}-I) \left( H_{k-t}-B^{k-t}\right) q^{k-t} \end{aligned}$$
(14)

By (10) and Lemma 1,

$$\begin{aligned} \Vert W^k_m\tilde{u}^{k-m+1}\Vert&= \Vert W^k_mJ\tilde{u}^{k-m+1} \Vert \le \nu \Vert J\tilde{u}^{k-m+1}\Vert \\&=\nu \Vert \tilde{u}^{k-m+1}\Vert \end{aligned}$$

and by (11), the definition of \(B^k\) and the fact that \(W^k\) is doubly stochastic, we get

$$\begin{aligned} \Vert W^k_t(W^{k-t}-I)(H_{k-t}-B^k)q^{k-t}\Vert \le (L+b)\Vert q^{k-t}\Vert . \end{aligned}$$

Taking the norm in (14) and using the two previous inequalities, we have that for \(k\ge m-1\)

$$\begin{aligned} \Vert \tilde{u}^{k+1}\Vert \le \nu \Vert \tilde{u}^{k-m+1}\Vert +(b+L)\sum _{t=0}^{m-1}\Vert q^{k-t}\Vert . \end{aligned}$$

Notice that the above inequality also holds for the third case considered, i.e. for \(B^k = 0,\) taking \(b=0\). Multiplying by \(\frac{1}{\lambda ^{k+1}}\), taking the maximum for \(k=-1:\bar{k}-1\), and defining

$$\begin{aligned} \tilde{\omega }_1 = \max _{k=-1:m-1} \left\{ \frac{1}{\lambda ^{k+1}}\Vert \tilde{u}^{k+1}\Vert \right\} \end{aligned}$$

we get

$$\begin{aligned} \Vert \tilde{u}\Vert ^{\lambda \bar{k}}&= \max _{k=-1:m-1} \left\{ \frac{1}{\lambda ^{k+1}}\Vert \tilde{u}^{k+1}\Vert \right\} +\max _{k=m:\bar{k}}\left\{ \frac{1}{\lambda ^{k+1}} \Vert \tilde{u}^{k+1}\Vert \right\} \\&\le \frac{\nu }{\lambda ^m}\max _{k=m:\bar{k}} \left\{ \frac{1}{\lambda ^{k-m+1}}\Vert \tilde{u}^{k-m+1}\Vert \right\} \\&\quad + (b+L)\sum _{t=0}^{m-1}\frac{1}{\lambda ^t} \max _{k=m:\bar{k}}\left\{ \frac{1}{\lambda ^{k-t}} \Vert q^{k-t}\Vert \right\} + \tilde{\omega }_1\\&\le \frac{\nu }{\lambda ^m}\Vert \tilde{u}\Vert ^{\lambda \bar{k}} +\frac{(b+L)}{\lambda ^m}\frac{\lambda (1-\lambda ^m)}{(1-\lambda )} \Vert q\Vert ^{\lambda \bar{k}} + \tilde{\omega }_1. \end{aligned}$$

Since by condition 1 in Lemma 3 we have \(\nu <\lambda ^m\), reordering the terms in the previous inequality and using the fact that \(q^k = \tilde{x}^k+e\bar{q}^k\), we get

$$\begin{aligned} \Vert \tilde{u}\Vert ^{\lambda \bar{k}}&\le \gamma _1\Vert q\Vert ^{\lambda \bar{k}} + \omega _1 \nonumber \\&\le \gamma _1\Vert \tilde{x}\Vert ^{\lambda \bar{k}}+\gamma _1n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}} + \omega _1 \end{aligned}$$
(15)

with

$$\begin{aligned} \gamma _1 = \frac{(b+L)\lambda (1-\lambda ^m)}{(1-\lambda ) (\lambda ^m-\nu )}, \qquad \omega _1 = \frac{\lambda ^m}{\lambda ^m-\nu } \tilde{\omega }_1. \end{aligned}$$

Let us now consider \(\bar{q}^k\).

$$\begin{aligned} \bar{q}^{k+1}&= \bar{x}^{k+1}-y^* = \frac{1}{n}ee^Tx^{k+1}-y^*\\&=\frac{1}{n}ee^T\left( W^kx^k-D^k(u^k+\nabla F(x^k)\right) -y^* \\&= \bar{x}^k-y^* -\frac{d_{min}}{n}\nabla F(\bar{x}^k)\\&\quad +\frac{d_{min}}{n}\sum _{j=1}^n(\nabla f_j(\bar{x}^k)-\nabla f_j(y^*))\\&\quad +-\frac{1}{n}\sum _{j=1}^n(d_j^k-d_{min})(\nabla f_j(x^k_j) -\nabla f_j(y^*)) \\&\quad + \frac{1}{n}\sum _{j=1}^n(d_{min}-d^k_j)\tilde{u}^k_j. \end{aligned}$$

Taking the norm, by Lipschitz continuity of the gradient and denoting with \(\Delta = d_{max}-d_{min}\), we have

$$\begin{aligned} \Vert \bar{q}^{k+1}\Vert&= \left\| \bar{x}^k-y^* -\frac{d_{min}}{n}\nabla F(\bar{x}^k)\right\| +\frac{Ld_{min}}{n}\Vert \bar{x}^k-y^*\Vert _1\\&\quad + \frac{L\Delta }{n}\Vert x^k-x^*\Vert _1 + \frac{\Delta }{n}\Vert \tilde{u}^k_j\Vert _1. \end{aligned}$$

Now \(\frac{d_{min}}{n}<\frac{2}{L}\), thus Lemma 2 gives a bound for the first term in the right hand side of the last inequality, and we get

$$\begin{aligned} n^{1/2}\Vert \bar{q}^{k+1}\Vert&\le n^{1/2}\tau \Vert \bar{x}^k-y^*\Vert + Ld_{min}\Vert \bar{x}^k-y^*\Vert \\&\quad + L\Delta \Vert x^k-x^*\Vert + \Delta \Vert \tilde{u}^k_j\Vert \\&\le n^{1/2}(\tau +\Delta L)\Vert \bar{q}^k\Vert +Ld_{max}\Vert \tilde{x}^k\Vert +\Delta \Vert \tilde{u}^k\Vert . \end{aligned}$$

Multiplying by \(\frac{1}{\lambda ^{k+1}}\) and taking the maximum for \(k=-1:\bar{k}-1\) we get

$$\begin{aligned} n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}\le \frac{\tau +\Delta L}{\lambda } n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}+\frac{Ld_{max}}{\lambda }\Vert \tilde{x}\Vert ^{\lambda \bar{k}} +\frac{\Delta }{\lambda }\Vert \tilde{u}\Vert ^{\lambda \bar{k}}. \end{aligned}$$

By Lemma 3 we have \(\tau = 1-\mu d_{min}\) and \(\tau +\Delta L<\lambda\), thus reordering and using (15), we get

$$\begin{aligned} n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}&\le \frac{Ld_{max}}{\lambda -\tau -\Delta L}\Vert \tilde{x}\Vert ^{\lambda \bar{k}} +\frac{\Delta }{\lambda -\tau -\Delta L}\Vert \tilde{u}\Vert ^{\lambda \bar{k}}\\&\le (\beta _1+\gamma _1\beta _2)\Vert \tilde{x}\Vert ^{\lambda \bar{k}} +\gamma _1\beta _2n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}+\beta _2\omega _1 \end{aligned}$$

where \(\beta _1\) and \(\beta _2\) are defined in Lemma 3. Take

$$\begin{aligned} \gamma _2 = \frac{\beta _1+\gamma _1\beta _2}{1-\gamma _1\beta _2}, \,\, \omega _2 = \frac{\beta _2\omega _1}{1-\gamma _1\beta _2}. \end{aligned}$$

From 4 in Lemma 3 we get

$$\begin{aligned} n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}\le \gamma _2\Vert \tilde{x}\Vert ^{\lambda \bar{k}}+\omega _2. \end{aligned}$$
(16)

Finally, let us consider \(\tilde{x}^k\). For \(k\ge m-1\), by definition of \(x^k, \tilde{u}^k\) and \(q^k,\) and Eq. (11) we have

$$\begin{aligned} \tilde{x}^{k+1}&=J(W^kx^k-D^k(u^k+\nabla F(x^k)) \\&=JW^kW^{k-1}x^{k-1}-JW^kD^{k-1}(u^{k-1}+\nabla F(x^{{k-1}}))\\&\quad +-J D^{k}(u^{k}+\nabla F(x^{{k}}))\\&=JW^k_mx^{k-m+1}-J\sum _{t=0}^{m-1}W^k_tD^{k-t} (\tilde{u}^{k-t}+H_{k-t}q^{k-t}). \end{aligned}$$

Taking the norm, applying Lemma 1 and (11), we get

$$\begin{aligned} \Vert \tilde{x}^{k+1}\Vert \le \nu \Vert \tilde{x}^{k-m+1}\Vert +d_{max}\sum _{t=0}^{m-1} (\Vert \tilde{u}^{k-t}\Vert +L\Vert q^{k-t}\Vert ). \end{aligned}$$

Multiplying by \(\frac{1}{\lambda ^{k+1}}\) and taking the maximum for \(k=-1:\bar{k}-1\) we get

$$\begin{aligned} \Vert \tilde{x}\Vert ^{\lambda \bar{k}}&\le \frac{\nu }{\lambda ^m}\Vert \tilde{x}^{k-m+1}\Vert +d_{max}\frac{\lambda (1-\lambda ^m)}{\lambda ^m(1-\lambda )}\Vert \tilde{u}\Vert ^{\lambda \bar{k}}\\&\quad +Ld_{max}\frac{\lambda (1-\lambda ^m)}{\lambda ^m(1-\lambda )} \Vert q\Vert ^{\lambda \bar{k}}+\tilde{\omega }_3\\&\le \beta _3\Vert \tilde{x}^{k-m+1}\Vert +\beta _4n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}+\beta _5\Vert \tilde{u}\Vert ^{\lambda \bar{k}}+\tilde{\omega }_3. \end{aligned}$$

where \(\tilde{\omega }_3 = \max _{k=-1:m-1} \left\{ \frac{1}{\lambda ^{k+1}}\Vert \tilde{x}^{k+1}\Vert \right\}\) and \(\beta _3, \beta _4, \beta _5\) are defined in Lemma 3. In particular, we have \(\beta _3<1\), and can rearrange the terms of the previous inequality to get

$$\begin{aligned} \Vert \tilde{x}\Vert ^{\lambda \bar{k}}\le \frac{\beta _4}{1-\beta _3}n^{1/2} \Vert \bar{q}\Vert ^{\lambda \bar{k}}+\frac{\beta _5}{1-\beta _3} \Vert \tilde{u}\Vert ^{\lambda \bar{k}}+\frac{\tilde{\omega }_3}{1-\beta _3}. \end{aligned}$$

Now, applying (15) and 6 from Lemma 3, we obtain

$$\begin{aligned} \Vert \tilde{x}\Vert ^{\lambda \bar{k}}\le \gamma _3 n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}} +\omega _3 \end{aligned}$$

with

$$\begin{aligned} \gamma _3 = \frac{\beta _4+\beta _5\gamma _1}{1-\beta _3-\gamma _1\beta _5}, \,\, \omega _3= \frac{\tilde{\omega }_3+\beta _5\omega _1}{1-\beta _3-\gamma _1\beta _5}. \end{aligned}$$

We thus proved

$$\begin{aligned}&n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}}\le \gamma _2\Vert \tilde{x}\Vert ^{\lambda \bar{k}} +\omega _2\\&\Vert \tilde{x}\Vert ^{\lambda \bar{k}}\le \gamma _3n^{1/2}\Vert \bar{q}\Vert ^{\lambda \bar{k}} +\omega _3 \end{aligned}$$

with \(\gamma _2\gamma _3<1\) by condition 7 in Lemma 3. By the Small Gain Theorem, we have that \(\Vert \bar{q}^k\Vert\) and \(\Vert \tilde{x}^k\Vert\) converge to 0, and thus \(\Vert q^k\Vert\) converges to zeros, which gives the thesis. \(\square\)

The above theorem states the R-linear convergence of the method and hence one might naturally ask what is the convergence factor and how it compares with similar methods, in particular with DIGing. Given that the setting here is rather general – allowing for node-specific and iteration-specific step sizes on time varying networks, the generality of the encompassed methods and settings makes analytical close-form expression of the convergence factor unfeasible. One comparison between the convergence factors of DIGing and a method of the class considered here under restrictive assumptions of a static network is given in [13], Remark 5. Therein, it is shown that the class of methods considered here can have a better convergence factor than DIGing. Specifically, the favorable convergence factor is achieved in [13] for a method instance within the class when parameter b is set differently than the choice that recovers DIGing. Although such comparison is derived for a narrow class of problems, contrary to the more general setting of DIGIng and the setting considered here, with fixed stepsizes and static networks, it serves as an indication for the comparison between DIGing and the method considered here. Furthermore, numerical experiments presented in Sect. 4 contain the comparison between the method considered here and DIGing and show faster convergence and an increased degree of robustness of the method considered here.

4 Analytical and numerical studies of robustness of the methods

Theorem 2 and Lemma 3 ensure convergence of the considered class of methods. Namely, they establish existence of bounds \(d_{min}<d_{max}\) such that the methods converge R-linearly under the given assumptions. However they do not provide any information about the difference \(\Delta = d_{max}-d_{min}\) and thus about how much the steps employed by different nodes and at different iterations can differ. In this section we try to address this issue by investigating in practice the length of the interval of admissible step-sizes. Firt we show a particular example where the method converges without any upper bound \(d_{max},\) then we present a set of numerical results that show how the step bounds influence the convergence and the performance of the methods.

We consider the same framework considered in [14] (Sect. 4.2) and we prove that even if we allow the consensus matrix to change from iteration to iteration, the method converges. Consider the following objective function

$$\begin{aligned} f(y) = \sum _{i=1}^nf_i(y)\text { with } f_i(y) =\frac{1}{2}(y-a_i)^2\text { and } a\in {\mathbb {R}}^{n}\end{aligned}$$
(17)

and assume that at iteration k the consensus matrix is given by

$$\begin{aligned} W_k = (1-\theta _k)I+\theta _k J, \text { with } \theta \in (0,1). \end{aligned}$$

Lemma 4

Assume that \(\theta _k\in (\frac{1}{3}, \frac{3}{4})\) for every k, and that \(\{x^k\}\) is the sequence generated by (5) with \(b=0\), \(e^Tx_0 = e^Ta\) and \(e^T(u_0+\nabla F(x_0)) = 0\).

If \(d^k_i = \alpha\) for every \(i=1,\dots ,n\) and for every k, then the method converges R-linearly to the solution of (17) if \(\alpha _{min}\le \alpha \le \frac{2}{3}\) and \(\alpha _{min}>0\) small enough. On the other hand, for any \(\alpha >2\), there exists a sequence \(\{\theta _k\},\ k=0,1,2,\dots\) that satisfies the assumptions of the Lemma such that the method diverges, i.e., \(\Vert x^k\Vert \rightarrow \infty\).

If \(d_i^k = (\sigma ^k_i)^{-1}\) with

$$\begin{aligned} \sigma ^{k+1}_i = {{\mathcal {P}}}_{[\sigma _{min}, \sigma _{max}]} \left( 1+\sigma ^k_i\sum _{j=1}^nw^k_{ij} \left( 1-\frac{s^k_j}{s^k_i}\right) \right) , \end{aligned}$$
(18)

with \(s^k_i = x^{k+1}_i-x^k_i\), \(\sigma _{min} = 0\), \(\sigma _{max} = 3/2\) and \(\sigma ^0_i = \sigma \in (\sigma _{min}, \sigma _{max})\) for every \(1,\dots ,n\), then \(\{x^k\}\) converges R-linearly to the solution of (17).

Proof

In the case we are considering, (5) is equivalent to

$$\begin{aligned} {\left\{ \begin{array}{ll} x^{k+1} = W^kx^k-D^kz^k\\ z^{k+1} = W^kz^k + x^{k+1}-x^k \end{array}\right. } \end{aligned}$$
(19)

where \(D^k = diag(d^k_1I,\dots ,d^k_nI)\).

Let us consider the case with fixed step-size \(d^k_i = \alpha\) and let us denote with \(\xi ^k\) the vector \((q^k, z^k)\in {\mathbb {R}}^{2n}\). We can see that for every k we have \(\xi ^{k+1} = A_k\xi ^k\) where the matrix \(A_k\) is given by

$$\begin{aligned} A_k = \left( \begin{matrix} W_k-J &{} -\alpha I\\ W_k-I &{} W_k-\alpha I \end{matrix}\right) \in {\mathbb {R}}^{2n\times 2n}. \end{aligned}$$

In order to prove the first part of the Lemma, it is enough to show that there exists \(\mu <1\) such that \(\Vert A_k\Vert ^2_2<\mu\) for every iteration index k. That is, we have to prove that there exists \(\mu <1\) such that the spectral radius of \(A_k^TA_k\) is smaller than \(\mu\) for every k. Denoting with \({1, \lambda ^k_2,\dots , \lambda ^k_n}\) the eigenvalues of \(W^k\), it can be proved that the eigenvalues of \(A_k^TA_k\) are given by the eigenvalues of the \(2\times 2\) matrices \(M^k_i\) defined as

$$\begin{aligned}{} & {} M^k_1 = \left( \begin{matrix} \alpha ^2 &{} \alpha (\alpha -1) \\ \alpha (\alpha -1) &{} (\alpha -1)^2 \end{matrix}\right) \\{} & {} M^k_i = \left( \begin{matrix} (\lambda ^k_i)^2+\alpha ^2 &{} (\lambda ^k_i)^2-(1+\alpha )\lambda ^k_i+\alpha ^2\\ (\lambda ^k_i)^2-(1+\alpha )\lambda ^k_i+\alpha ^2&{} 2(\lambda ^k_i)^2-2(1+\alpha )\lambda _i^k + 1+\alpha ^2 \end{matrix}\right) \end{aligned}$$

for \(i=2,\dots ,n\). By direct computation we can see that the eigenvalues of \(M^k_1\) are given by 0 and \(2\alpha ^2-2\alpha +1<1-\frac{2}{3}\alpha _{min}\) and therefore it is enough to take \(\mu >1-\frac{2}{3}\alpha _{min}.\) Denoting with \(p^k_i(t)\) the characteristic polynomial of \(D^k_i\) we can see that with the values of \(\theta _{min}, \theta _{max}\) and \(\alpha _{max}\) given by the assumptions, we can always find \(1-\frac{2}{3}\alpha _{min}<\mu <1\) such that \(p^k_i(\mu )>0\) and \(p^k_i(-\mu )>0\) and thus such that the eigenvalues of \(M^k_i\) belong to \((-\mu\) and \(\mu )\) for every k and for every \(i=1,\dots ,n.\) To prove that if \(\alpha >2\) the method is in general not convergent it is enough to consider the case when \(\theta _k = \theta _0\) for every iteration index k. In this case we have that \(A_k = A_0\) for every k and thus \(\xi ^{k}= A_0^k\xi ^0\). In this case we can see [14] that \(1-\alpha\) is an eigenvalue of \(A_0\) an therefore if \(\alpha >2\) we have that \(\rho (A_0)>1\) and thus the sequence \(\{\xi ^k\}\) does not converge. This concludes the first part of the proof.

Assume now that the step-sizes are computed as in (18). Proceeding as in the proof of Proposition 4.3 in [14] we can prove that \(\sigma ^{k+1}_i = \sigma ^{k+1}\) for every i with \(\sigma _{k+1}\) given by

$$\begin{aligned} \sigma _{k+1} = {\left\{ \begin{array}{ll} \min \{\sigma _{max}, 1+\sigma _{max}\theta _k\} &{} \text {if }\ \sigma _k = \sigma _{max}\\ \min \{\sigma _{max}, \hat{\sigma }^{k+1}\} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where \(\hat{\sigma }^{k+1} =1+\theta _k+\theta _k\theta {k-1}+\dots +\prod _{j=1}^k\theta ^j+\sigma ^0\prod _{j=0}^k\theta ^j\). By using the fact that \(\theta _k > 1/3\) and \(\sigma _{max} = 3/2\) we can prove that there exists \(\bar{k}\) such that \(\sigma ^k = \sigma _{max}\) for every \(k>\bar{k}.\) Therefore, for \(k>\bar{k}\) the step-size becomes the same for all nodes and equal to \(d^k_i = \sigma _{max}^{-1} =2/3\) and thus the method converges by the first part of the Lemma. \(\square\)

The above Lemma certifies convergence of the spectral-like method [14] for time-varying networks and a very specific problem structure with all-to-all communication network and consensus quadratic costs. It is worth noting that, for generic quadratic cost functions and sparse time-varying networks, an upper bound on the step-size is necessary (see Figs. 1 and 2 below). We now make an analogy on the achieved results for the distributed spectral-like method [14] and the spectral (Barzilei-Borwein) gradient method from the centralized optimization. Namely, in centralized settings, the spectral gradient method’s convergence without step size safeguarding has been proved only for a strictly convex quadratic cost function. In the case of generic functions beyond strictly convex quadratic, some safeguards \(\Delta _{min}\) and \(\Delta _{max}\) are necessary, even in the centralized case. Though, in the centralized case, these safeguards can be arbitrarily small (\(\Delta _{min}\)) and arbitrarily large (\(\Delta _{max}\)). Therefore, the need for safeguards is to be expected in the distributed optimization scenario as well. This matches with the results that we present here. It turns our that the price to be payed in the distributed time-varying networks scenario is two-fold: (1) the no-safeguards case happens in a more restricted cost functions setting, namely the consensus quadratic costs (see Lemma 4); and (2) the safeguard step size bounds in the general case are no longer arbitrary and take a network-dependent form.

Fig. 1
figure 1

Number of iterations performed by the methods for different values of the maximum stepsize

Fig. 2
figure 2

Results presented in Fig. 1, with y-axis cut at 2000

We also have the following Lemma where we continue to assume the consensus problem but relax the requirement that the network is fully connected at all times. When the network is not fully connected, in general we need safeguarding for global convergence. However, as explained below, the following Lemma sheds some light on the behavior of the spectral-like distributed method. While it is not to be considered as a global convergence result, it highlights that the next step size has a controlled length provided that the current solution estimate is close to consensus.

Lemma 5

Let us assume that the objective function is given by (17), and that \(x^0,z^0\) are such that \(e^Tx^0 =e^Ta\), \(z^0_i = \nabla f(x^0_i) = x^0_i\). Moreover, for every \(i=1,\dots ,n\) let the local stepsize \(d^k_i\) be defined as \(d^0_i =d^0>0\) and, for every \(k\ge 1\), \(d^k_i = 1/\sigma ^k_i\), with

$$\begin{aligned} \sigma _i^{k+1} = 1+\sigma ^k_i\sum _{j=1}^nw^k_{ij} \left( 1-\frac{s^k_j}{s^k_i}\right) \end{aligned}$$
(20)

where \(s^k_j = x^{k+1}_j-x^k_j\). Moreover, let us assume that at each iteration Assumption A2 holds with \(m=1\).

Given any \(\hat{d}>1\), if \(\Vert x^0-e\bar{x}^0\Vert \le \hat{\varepsilon }\) with \(\hat{\varepsilon }\) satisfying

$$\begin{aligned} \hat{\varepsilon }\le \frac{1}{\nu _0+d^0}\frac{(d^0)^2(\hat{d}-1)| \bar{x}^0|}{2\hat{d} + d^0(\hat{d}-1)} \end{aligned}$$

then \(d^1_i\le \hat{d}\) for every \(i=1,\dots ,n\).

Proof

Let us denote with \(J\in {\mathbb {R}}^{n\times n}\) the matrix \(\frac{1}{n}ee^t\) and with \(v^k = s^k-e\bar{s}^k\). From the assumptions and the double stochasticity of the matrix \(W^k\), we have

$$\begin{aligned} \Vert v^0\Vert&= \Vert (I-J)(x^1-x^0)\Vert = \Vert (I-W)(W^0x^0 - x^0-d^0z^0)\Vert \\&= \Vert (I-J)(W^0-I-d^0I)x^0\Vert \le \Vert (W^0-I)x^0\Vert +d^0\Vert (I-J)x^0\Vert \\&\le (\nu _0+d^0)\Vert x^0-e\bar{x}^0\Vert \le (\nu _0+d^0)\hat{\varepsilon } = \varepsilon \end{aligned}$$

Where we defined \(\varepsilon = (\nu _0+d^0)\hat{\varepsilon }\). Moreover,

$$\begin{aligned} |\bar{s}^0| = \frac{1}{n}|e^t(W^0x^0-d^0x^0-x^0)| = d^0|\bar{x}^0|. \end{aligned}$$

These imply that, for every \(j=1,\dots ,n\)

$$\begin{aligned} 1-\frac{2\varepsilon }{d^0|\bar{x}^0|-\varepsilon } \le \frac{s^0_j}{s^0_i}\le 1+\frac{2\varepsilon }{d^0| \bar{x}^0|-\varepsilon }. \end{aligned}$$

Replacing these bounds in (20), and defining \(\sigma ^0 =1/d^0\), we get

$$\begin{aligned} 1-\frac{2\varepsilon \sigma ^0}{d^0|\bar{x}^0|-\varepsilon } \le \sigma ^1_i\le 1+\frac{2\varepsilon \sigma ^0}{d^0| \bar{x}^0|-\varepsilon }. \end{aligned}$$
(21)

It’s easy to see that the first inequality, together with the assumption over \(\hat{\varepsilon }\), imply \(\sigma ^1_i\ge 1/\hat{d}\), which in turn implies the thesis. \(\square\)

Intuitively, the Lemma above says that, for the considered problem, if algorithm (5) with stepsize (20) starts from a point close to consensus (i.e., a point where solution estimates across different nodes are mutually close), then the next step size at each node will not be too large. More precisely, the size of the next step size is controlled by the consensus neighborhood \(\hat{\epsilon }\) that we start from. In other words, if the next step size is to be upper bounded by an arbitrary constant \(\hat{d}>1\), we can find a problem-dependent constant \(\hat{\epsilon }\) such that, starting at most \(\hat{\epsilon }\) away from consensus, the next step size at each node is at most \(\hat{d}\). To further explain this, suppose that all the quantities \(s_j^k/s_i^k\)’s are \(\epsilon ^\prime\)-close to one, \(|s_j^k/s_i^k| \in (1-\epsilon ^\prime , 1+\epsilon ^\prime )\), for all nodes ij. Then, in view of (21), quantity \(\sigma _i^{k+1}\), for all nodes i, is approximated as:

$$\begin{aligned} 1 \pm \sigma _i^k \,n\, \epsilon ^\prime . \end{aligned}$$

In other words, for the special case of the consensus problem, provided that all the quantities \(s_j^k/s_i^k\)’s are \(\epsilon ^\prime\)-close to one, the next step-size \(1/\sigma _i^{k+1}\) is in a neighborhood of one, and is hence bounded.

We now present some numerical results. We consider the problem of minimizing a logistic loss function with \(l_2\) regularization, that is, we assume the local objective function \(f_{i}\) at node i is given by

$$\begin{aligned} f_i(y) = \ln \left( 1+\exp (-b_ia_i^Ty)\right) + \frac{1}{2}R\Vert y\Vert ^2_2 \end{aligned}$$
(22)

where \(a_i\in {\mathbb {R}}^d,\ b_i\in \{-1,1\}\) and \(R>0.\) We compare 3 different choices of the matrix B in (5) and three different definitions of the step-sizes \(d^k_i\), resulting in nine methods. For increasing values of \(d_{max}\) we run each method on the given problem and we plot in Fig. 1 the number of iterations necessary to arrive at convergence.

The problem is generated as follows. The convergence analysis we carried out in Sect. 3 does not rely on any particular definition of the step-sizes \(d^k_i,\) therefore we need to specify how each node chooses the step-size at each iteration. We consider here two cases. The first one, referred to as spectral in Fig. 1, is the case where \(d^k_i = (\sigma ^k_i)^{-1}\) with \(\sigma ^k_i\) as in (18). The second case we consider is the one where each node performs local line search by employing a backtracking strategy starting at \(d_{max}\) to satisfy classical Armijo condition on the local objective function. That is, to satisfy

$$\begin{aligned} f_i\left( \sum _{j=1}^n w^k_{ij}x^k_j-d^k_i z^k_i \right) \le f_i(x^k_i) - cd^k_i\nabla f_i(x^k_i)^Tz^k_i \end{aligned}$$

with \(c = 10^{-3}\) and \(z^k_i = u^k_i+\nabla f_i(x^k_i).\) We refer to this method as line search. It is worth noting that there are no convergence guarantees for the line search method. The rationale for including a comparison with it is to show that the method [14] exhibits a significantly higher degree of robustness with respect to a meaningful, time-varying and node-varying, local step size strategy that can be employed. For comparison, we also consider the method with fixed step-size \(d^k_i = d_{max}\) for every k and every \(i=1,\dots ,n.\) The choices of the matrix \(B^k\) are given by \(B^k=0\) (plot (a) in Fig. 1), \(B^k=d_{max}^{-1}I\) (plot (b)) and \(B^k = d_{max}^{-1}{\mathcal {W}}^k\) (plot(c)), where for the case \(B\ne 0\) the choice is made following [13]. Notice that the case \(d_i^k=d_{max}\) and \(B^k=0\) corresponds to [18, 21] with constant, coordinated step-sizes. We consider increasing values of \(d_{max}\) in \([\frac{1}{50L}, \frac{10}{L}]\), while we fix \(d_{min}=10^{-8}\) as, in the considered framework, we saw that its choice does not influence the performance of the methods significantly.

In Fig. 1 we plot the results in the case where the underlying network is symmetric and timevarying, defined as follows: we consider a network G with \(n=25\) nodes undirected and connected, generated as a random geometric graph with communication radius \(\sqrt{n^{-1}\ln (n)}\), and we define the sequence of networks \(\{G^k\}\) by deleting each edge with probability \(\frac{1}{4}.\) We carried out analogous tests in the cases where G is symmetric and constant and in the case where it is given by a directed ring. The obtained results were comparable to the ones that we present. We also observed in practice that double stochasticity of the underlying network appears to be essential for the convergence of the considered methods.

We set the dimension d as equal to 10 and we generate the quantities involved in the definition of the local objective functions (22) as follows. For \(i = 1,\dots ,n\) we define \(a_i = (a_{i1},\dots ,a_{i,d-1},1)^T\) where the components \(a_{ij}\) are independent and come from the standard normal distribution, and \(b_i = sign(a_i^Ty^*+\epsilon _i)\) where \(y^*\in {\mathbb {R}}^d\) with independent components drawn from the standard normal distribution, and \(\epsilon _i\) are generated according to the normal distribution with mean 0 and standard deviation 0.4. Finally, we take the regularization parameter \(R = 0.25.\) The initial vectors \(x^0_i\) are generated independently, with components drawn from the uniform distribution on [0, 1], and at each iteration we define the consensus matrix \(W_k\) as the Metropolis matrix [31].

We are interested in the number of iterations required by each method to reach a prescribed accuracy. More precisely, we evaluate the iteration number \(\bar{k}\) at which \(\max _{i=1,...,n}\Vert x_i^{\bar{k}}-y^*\Vert < \varepsilon\), where \(\varepsilon =10^{-5}\). In Fig. 1, on the x-axis we show the upper bound \(d_{max}\) while on the y-axis we show \(\bar{k}\) for each method. To facilitate the comparison among the methods, in Fig. 2 we plot the same results, with y-axis cut at 2000.

We can see from Fig. 1 that for all considered choices of the matrix B the spectral method allows for maximum step-size that is at least 10 times larger than the method with fixed step-size, while line search allows for maximum step-size equal to 2 and 3 times the maximum step-size allowed by the method with fixed steplength, for B equal to 0 and bI or \(b{{\mathcal {W}}}\) respectively. Moreover, we can see that choosing \(B = bI\) seems to increase the maximum value of \(d_{max}\) that yields convergence for all the considered methods. Finally, in Fig. 2, we can notice that for most of the tested values of \(d_{max}\) the spectral methods requires a smaller number of iterations than the method with fixed step-size. That is, in the considered framework, using uncoordinated time-varying step-sizes given by [14] helps to significantly improve the robustness of the method and also the performance. Notice also that the spectral step-size strategy exhibits a “stable", practically unchanged, performance for a wide range of \(d_{max}\); hence, it is not sensitive to tuning of \(d_{max}\). This is in contrast with the constant step-size strategy that is very sensitive to the step-size choice \(d_{max}\). It is also worth noting that Theorem 2 requires a conservative upper bound on the step-size \(d_{max}\) and a conservative upper bound on step-size differences \(\Delta\) and that both depend on multiple global system parameters (Lemma 3). However, simulations presented here and other extensive numerical studies suggest that an a priori upper bound on \(\Delta\) is not required for convergence. In addition, \(d_{min}\) can be set to a small value independent of system parameters, e.g., \(d_{min}=10^{-8}\), and setting \(d_{max}\) requires only a coarse upper bound on quantity 1/L.

5 Conclusions

We proved that a class of distributed first-order methods, including those proposed in [13, 14], is robust to time-varying and uncoordinated step-sizes and time-varying weight-balanced digraphs, wherein connectedness of the network at each iteration, unlike, e.g., the recent work [28], is not required. The achieved results provide a solid improvement in understanding of the robustness of exact distributed first-order methods to time-varying networks and uncoordinated time-varying step-sizes. Most notably, we show that the unification strategy in [13] and the spectral-like step-size selection strategy in [14], as well as combination of those, exhibits a high degree of robustness. This paper considers weight-balanced directed networks. Extensions to weight-imbalanced networks requires redefining the algorithmic class and the respective analysis, and represents an interesting future research direction.