# FROST—Fast row-stochastic optimization with uncoordinated step-sizes

- 346 Downloads

**Part of the following topical collections:**

## Abstract

In this paper, we discuss distributed optimization over directed graphs, where doubly stochastic weights cannot be constructed. Most of the existing algorithms overcome this issue by applying push-sum consensus, which utilizes column-stochastic weights. The formulation of column-stochastic weights requires each agent to know (at least) its out-degree, which may be impractical in, for example, broadcast-based communication protocols. In contrast, we describe FROST (Fast Row-stochastic-Optimization with uncoordinated STep-sizes), an optimization algorithm applicable to directed graphs that does not require the knowledge of out-degrees, the implementation of which is straightforward as each agent locally assigns weights to the incoming information and locally chooses a suitable step-size. We show that FROST converges linearly to the optimal solution for smooth and strongly convex functions given that the largest step-size is positive and sufficiently small.

## Keywords

Distributed optimization Multiagent systems Directed graphs Linear convergence## Abbreviations

- ADD-OPT
(Accelerated Distributed Directed OPTimization)

- DGD
(Distributed Gradient Descent)

- EXTRA
(EXact firsT-ordeR Algorithm)

- FROST
(Fast Row-stochastic-Optimization with uncoordinated STep-sizes)

## 1 Introduction

*n*agents are tasked to solve the following problem:

*i*. The goal of the agents is to find the global minimizer of the aggregate cost,

*F*(

**x**), via local communication with their neighbors and without revealing their private objective functions. This formulation has recently received great attention due to its extensive applications in, for example, machine learning [1, 2, 3, 4, 5, 6], control [7], cognitive networks, [8, 9], and source localization [10, 11].

Early work on this topic includes Distributed Gradient Descent (DGD) [12, 13], which is computationally simple but is slow due to a diminishing step-size. The convergence rates are \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\) for general convex functions and \(\mathcal {O}\left (\frac {\log k}{k}\right)\) for strongly convex functions, where *k* is the number of iterations. With a constant step-size, DGD converges faster albeit to an inexact solution [14, 15]. Related work also includes methods based on the Lagrangian dual [16, 17, 18, 19] to achieve faster convergence, at the expense of more computation. To achieve both fast convergence and computational simplicity, some fast distributed first-order methods have been proposed. A Nesterov-type approach [20] achieves \(\mathcal {O}\left (\frac {\log k}{k^{2}}\right)\) for smooth convex functions with bounded gradient assumption. EXact firsT-ordeR Algorithm (EXTRA) [21] exploits the difference of two consecutive DGD iterates to achieve a linear convergence to the optimal solution. Exact diffusion [22, 23] applies an adapt-then-combine structure [24] to EXTRA and generalizes the symmetric doubly stochastic weights required in EXTRA to locally balanced row-stochastic weights over undirected graphs. Of significant relevance to this paper is a distributed gradient tracking technique built on dynamic consensus [25], which enables each agent to asymptotically learn the gradient of the global objective function. This technique was first proposed simultaneously in [26, 27]. Xu et al. and Qu and Li [26, 28] combine it with the DGD structure to achieve improved convergence for smooth and convex problems. Lorenzo and Scutari [27, 29], on the other hand, propose the NEXT framework for a more general class of non-convex problems.

All of the aforementioned methods assume that the multi-agent network is undirected. In practice, it may not be possible to achieve undirected communication. It is of interest, thus, to develop optimization algorithms that are fast and are applicable to arbitrary directed graphs. The challenge here lies in the fact that doubly stochastic weights, standard in many distributed optimization algorithms, cannot be constructed over arbitrary directed graphs. In particular, the weight matrices in directed graphs can only be either row-stochastic or column-stochastic, but not both.

We now discuss related work on directed graphs. Early work based on DGD includes subgradient-push [30, 31] and Directed-Distributed Gradient Descent (D-DGD) [32, 33], with a sublinear convergence rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). Some recent work extends these methods to asynchronous networks [34, 35, 36]. To accelerate the convergence, DEXTRA [37] combines push-sum [38] and EXTRA [21] to achieve linear convergence given that the step-size lies in some non-trivial interval. This restriction on the step-size is later relaxed in ADD-OPT/Push-DIGing [39, 40], which linearly converge for a sufficiently small step-size. Of relevance is also [41], where distributed non-convex problems are considered with column-stochastic weights. More recent work [42, 43] proposes the \(\mathcal {AB}\) and \(\mathcal {AB}m\) algorithms, which employ both row- and column-stochastic weights to achieve (accelerated) linear convergence over arbitrary strongly connected graphs. Note that although the construction of doubly stochastic weights is avoided, all of the aforementioned methods require each agent to know its out-degree to formulate doubly or column-stochastic weights. This requirement may be impractical in situations where the agents use a broadcast-based communication protocol. In contrast, Refs. [44, 45] provide algorithms that only use row-stochastic weights. Row-stochastic weight design is simple and is further applicable to broadcast-based methods.

In this paper, we focus on optimization with row-stochastic weights following the recent work in [44, 45]. We propose a fast optimization algorithm, termed as FROST (Fast Row-stochastic Optimization with uncoordinated STep-sizes), which is applicable to both directed and undirected graphs with uncoordinated step-sizes among the agents. Distributed optimization (based on gradient tracking) with uncoordinated step-sizes has been previously studied in [26, 46, 47], over undirected graphs with doubly stochastic weights, and in [48], over directed graphs with column-stochastic weights. These works introduce a notion of heterogeneity among the step-sizes, defined respectively as the relative deviation of the step-sizes from their average in [26, 46] and as the ratio of the largest to the smallest step-size in [47, 48]. It is then shown that when the heterogeneity is small enough, i.e., the step-sizes are very close to each other, and when the largest step-size follows a bound as a function of the heterogeneity, the proposed algorithms linearly converge to the optimal solution. A challenge in this formulation is that choosing a sufficiently small, local step-size does not ensure small heterogeneity, while no step-size can be chosen to be zero. In contrast, a major contribution of this paper is that we establish linear convergence with uncoordinated step-sizes when the upper bound on the step-sizes is independent of any notion of heterogeneity. The implementation of FROST therefore is completely local, since each agent locally chooses a sufficiently small step-size, independent of other step-sizes, and locally assigns row-stochastic weights to the incoming information. In addition, our analysis shows that all step-sizes except one can be zero for the algorithm to work, which is a novel result in distributed optimization. We show that FROST converges linearly to the optimal solution for smooth and strongly convex functions.

**Notation:** We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. The matrix, *I*_{n}, represents the *n*×*n* identity, whereas **1**_{n} (**0**_{n}) is the *n*-dimensional uncoordinated vector of all 1’s (0’s). We further use **e**_{i} to denote an *n*-dimensional vector of all 0’s except 1 at the *i*th location. For an arbitrary vector, **x**, we denote its *i*th element by [**x**]_{i} and diag{**x**} is a diagonal matrix with **x** on its main diagonal. We denote by *X*⊗*Y* the Kronecker product of two matrices, *X* and *Y*. For a primitive, row-stochastic matrix, \(\underline {A}\), we denote its left and right Perron eigenvectors by **π**_{r} and **1**_{n}, respectively, such that \(\boldsymbol {\pi }_{r}^{\top }\mathbf {1}_{n} = 1\); similarly, for a primitive, column-stochastic matrix, \(\underline {B}\), we denote its left and right Perron eigenvectors by **1**_{n} and **π**_{c}, respectively, such that \(\mathbf {1}_{n}^{\top }\boldsymbol {\pi }_{c} = 1\) [49]. For a matrix, *X*, we denote *ρ*(*X*) as its spectral radius and diag(*X*) as a diagonal matrix consisting of the corresponding diagonal elements of *X*. The notation ∥·∥_{2} denotes the Euclidean norm of vectors and matrices, while ∥·∥_{F} denotes the Frobenius norm of matrices. Depending on the argument, we denote ∥·∥ either as a particular matrix norm, the choice of which will be clear in Lemma 1, or a vector norm that is compatible with this matrix norm, i.e., ∥*X***x**∥≤∥*X*∥∥**x**∥ for all matrices, *X*, and all vectors, **x** [49].

We now describe the rest of the paper. Section 2 states the problem and assumptions. Section 3 reviews related algorithms that use doubly stochastic or column-stochastic weights and shows the intuition behind the analysis of these types of algorithms. In Section 4, we provide the main algorithm, FROST, proposed in this paper. In Section 5, we develop the convergence properties of FROST. Simulation results are provided in Section 6, and Section 7 concludes the paper.

## 2 Problem formulation

*n*agents communicating over a strongly connected network, \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), where \(\mathcal {V}=\{1,\cdots,n\}\) is the set of agents and \(\mathcal {E}\) is the set of edges, \((i,j), i,j\in \mathcal {V}\), such that agent

*j*can send information to agent

*i*, i.e.,

*j*→

*i*. Define \(\mathcal {N}_{i}^{{\text {in}}}\) as the collection of in-neighbors, i.e., the set of agents that can send information to agent

*i*. Similarly, \(\mathcal {N}_{i}^{{ \text {out}}}\) is the set of out-neighbors of agent

*i*. Note that both \(\mathcal {N}_{i}^{{ \text {in}}}\) and \(\mathcal {N}_{i}^{{ \text {out}}}\) include agent

*i*. The agents are tasked to solve the following problem:

*i*. We denote the optimal solution of P1 as

**x**

^{∗}. We will discuss different distributed algorithms related to this problem under the applicable set of assumptions, described below:

### **Assumption 1**

The graph, \(\mathcal {G}\), is undirected and connected.

### **Assumption 2**

The graph, \(\mathcal {G}\), is directed and strongly connected.

### **Assumption 3**

Each local objective, *f*_{i}, is convex with bounded subgradient.

### **Assumption 4**

*f*

_{i}, is smooth and strongly convex, i.e., ∀

*i*and \(\forall \mathbf {x}, \mathbf {y}\in \mathbb {R}^{p}\),

- i.There exists a positive constant
*l*such that$$\qquad\left\|\mathbf{\nabla} f_{i}(\mathbf{x})-\mathbf{\nabla} f_{i}(\mathbf{y})\right\|_{2}\leq l\|\mathbf{x}-\mathbf{y}\|_{2}. $$ - ii.there exists a positive constant
*μ*such that$$f_{i}(\mathbf{y})\geq f_{i}(\mathbf{x})+\nabla f_{i}(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x})+\frac{\mu}{2}\|\mathbf{x}-\mathbf{y}\|_{2}^{2}. $$

Clearly, the Lipschitz continuity and strong convexity constants for the global objective function, \(F=\tfrac {1}{n}{\sum }_{i=1}^{n}f_{i}\), are *l* and *μ*, respectively.

### **Assumption 5**

Each agent in the network has and knows its unique identifier, e.g., 1,⋯,*n*.

If this were not true, the agents may implement a finite-time distributed algorithm to assign such identifiers, e.g., with the help of task allocation algorithms, [50, 51], where the task at each agent is to pick a unique number from the set {1,…,*n*}.

### **Assumption 6**

Each agent knows its out-degree in the network, i.e., the number of its out-neighbors.

We note here that Assumptions 3 and 4 do not hold together; when applicable, the algorithms we discuss use either one of these assumptions but not both. We will discuss FROST, the algorithm proposed in this paper, under Assumptions 2, 4, 5.

## 3 Related work

In this section, we discuss related distributed first-order methods and provide an intuitive explanation for each one of them.

### 3.1 Algorithms using doubly stochastic weights

*i*maintains a local estimate, \(\mathbf {x}_{k}^{i}\), of the optimal solution,

**x**

^{∗}, and implements the following iteration:

where *W*={*w*_{ij}} is doubly stochastic and respects the graph topology. The step-size *α*_{k} is diminishing such that \({\sum }_{k=0}^{\infty }\alpha _{k}=\infty \) and \({\sum }_{k=0}^{\infty }\alpha _{k}^{2}<\infty \). Under the Assumptions 1, 3, and 6, DGD converges to **x**^{∗} at the rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). The convergence rate is slow because of the diminishing step-size. If a constant step-size is used in DGD, i.e., *α*_{k}=*α*, it converges faster to an error ball, proportional to *α*, around **x**^{∗} [14, 15]. This is because **x**^{∗} is not a fixed-point of the above iteration when the step-size is a constant.

^{1}. The algorithm is updated as follows [26, 28]:

initialized with \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. The first equation is essentially a descent method, after mixing with neighboring information, where the descent direction is \(\mathbf {y}_{k}^{i}\), instead of \(\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\) as was in Eq. (1). The second equation is a global gradient estimator when viewed as dynamic consensus [52], i.e., \(\mathbf {y}_{k}^{i}\) asymptotically tracks the average of local gradients: \(\frac {1}{n}{\sum }_{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\). It is shown in Refs. [28, 40, 46] that \(\mathbf {x}_{k}^{i}\) converges linearly to **x**^{∗} under Assumptions 1, 4, and 6, with a sufficiently small step-size, *α*. Note that these methods, Eqs. (1) and (2a)–(2b), are not applicable to directed graphs as they require doubly stochastic weights.

### 3.2 Algorithms using column-stochastic weights

where \(\overline {\mathbf {x}}_{k}=\frac {1}{n}{\sum }_{i=1}^{n}\mathbf {x}_{k}^{i}\). From Eq. (3), it is clear that the average of the estimates, \(\overline {\mathbf {x}}_{k}\), converges to **x**^{∗}, as Eq. (3) can be viewed as a centralized gradient method if each local estimate \(\mathbf {x}_{k}^{i}\) converges to \(\overline {\mathbf {x}}_{k}\). However, since the weight matrix is *not row-stochastic*, the estimates of agents will not reach an agreement [32]. This discussion motivates combining DGD with an algorithm, called push-sum, briefly discussed next, that enables agreement over directed graphs with column-stochastic weights.

#### 3.2.1 Push-sum consensus

*k*, each agent maintains two state vectors, \(\mathbf {x}_{k}^{i}\), \(\mathbf {z}_{k}^{i}\in \mathbb {R}^{p}\), and an auxiliary scalar variable, \(v_{k}^{i}\), initialized with \(v_{0}^{i}=1\). Push-sum performs the following iterations:

**1**

_{n}because \(\underline {B}\) is not row-stochastic and we denote it by

**π**_{c}. In fact, it can be verified that \({\lim }_{k\rightarrow \infty }v_{i}(k)=n[\!\boldsymbol {\pi }_{c}]_{i}\) and that \({\lim }_{k\rightarrow \infty }\mathbf {x}_{i}(k)=[\!\boldsymbol {\pi }_{c}]_{i}\sum _{i=1}^{n}\mathbf {x}_{i}(0)\). Therefore, the limit of

**z**

_{i}(

*k*), as the ratio of

**x**

_{i}(

*k*) over

*v*

_{i}(

*k*), is the average of the initial values:

In the next subsection, we present subgradient-push that applies push-sum to DGD, see [32, 33] for an alternate approach that does not require eigenvector estimation of Eq. (4a).

#### 3.2.2 Subgradient-push

initialized with \(v_{0}^{i}=1\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. The step-size, *α*_{k}, satisfies the same conditions as in DGD. To understand these iterations, note that Eqs. (5a)–(5c) are nearly the same as Eqs. (4a)–(4c), except that there is an additional gradient term in Eq. (5b), which drives the limit of \(\mathbf {z}_{k}^{i}\) to **x**^{∗}. Under the Assumptions 2, 3, and 6, subgradient-push converges to **x**^{∗} at the rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). For extensions of subgradient-push to asynchronous networks, see recent work [34, 35, 36]. We next describe an algorithm that significantly improves this convergence rate.

#### 3.2.3 ADD-OPT/Push-DIGing

**x**

^{∗}under the Assumptions 2, 4, and 6, in contrast to the sublinear convergence of subgradient-push. The three vectors, \(\mathbf {x}^{i}_{k}\), \(\mathbf {z}^{i}_{k}\), and \(\mathbf {y}^{i}_{k}\), and a scalar \(v^{i}_{k}\) maintained at each agent

*i*, are updated as follows:

where each agent is initialized with \(v_{0}^{i}=1\), \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\), and an arbitrary \(\mathbf {x}_{0}^{i}\). We note here that ADD-OPT/Push-DIGing essentially applies push-sum to the algorithm in Eqs. (2a)–(2b), where the doubly stochastic weights therein are replaced by column-stochastic weights.

#### 3.2.4 The \(\mathcal {AB}\) algorithm

^{2}. Each agent

*i*maintains two variables: \(\mathbf {x}_{k}^{i}\), \(\mathbf {y}_{k}^{i}\in \mathbb {R}^{p}\), where, as before, \(\mathbf {x}_{k}^{i}\) is the estimate of

**x**

^{∗}and \(\mathbf {y}_{k}^{i}\) tracks the average gradient, \(\frac {1}{n}{\sum }_{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\). The \(\mathcal {AB}\) algorithm, initialized with \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\) and arbitrary \(\mathbf {x}_{0}^{i}\) at each agent, performs the following iterations:

where \(\underline {A}=\{a_{ij}\}\) is row-stochastic and \(\underline {B}=\{b_{ij}\}\) is column-stochastic. It is shown that \(\mathcal {AB}\) converges linearly to **x**^{∗} for sufficiently small step-sizes under the Assumptions 2, 4, and 6 [42]. Therefore, \(\mathcal {AB}\) can be viewed as a generalization of the algorithm in Eqs. (2a)–(2b) as the doubly stochastic weights therein are replaced by row- and column-stochastic weights. Furthermore, it is shown in [43] that ADD-OPT/Push-DIGing in Eqs. (6a)–(6d) in fact can be derived from an equivalent form of \(\mathcal {AB}\) after a state transformation on the **x**_{k}-update; see [43] for details. For applications of the \(\mathcal {AB}\) algorithm to distributed least squares, see, for instance, [56].

## 4 Algorithms using row-stochastic weights

All of the aforementioned methods require at least each agent to know its out-degree in the network in order to construct doubly or column-stochastic weights. This requirement may be infeasible, e.g., when agents use broadcast-based communication protocols. Row-stochastic weights, on the other hand, are easier to implement in a distributed manner as every agent locally assigns an appropriate weight to each incoming variable from its in-neighbors. In the next section, we describe the main contribution of this paper, i.e., a fast optimization algorithm that uses only row-stochastic weights and uncoordinated step-sizes.

*α*

_{k}goes to 0, it can be verified that the agents achieve agreement. However, this agreement is not on the optimal solution. This can be shown [32] by defining an accumulation state, \(\widehat {\mathbf {x}}_{k}={\sum }_{i=1}^{n}[\!\boldsymbol {\pi }_{r}]_{i}\mathbf {x}_{k}^{i}\), where

**π**_{r}is the left Perron eigenvector of the row-stochastic weight matrix, to obtain

*imbalance*in the gradient term caused by the fact that

**π**_{r}is not a vector of all 1’s, a consequence of losing the column-stochasticity in the weight matrix. The modification, introduced in [44], is implemented as follows:

where \(\underline {A}=\{a_{ij}\}\) is row-stochastic and the algorithm is initialized with \(\mathbf {y}_{0}^{i}=\mathbf {e}_{i}\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. Equation (9a) asymptotically learns the left Perron eigenvector of the row-stochastic weight matrix \(\underline {A}\), i.e., \({\lim }_{k\rightarrow \infty }\mathbf {y}_{k}^{i}=\boldsymbol {\pi }_{r},\forall i\). The above algorithm achieves a sublinear convergence rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\) under the Assumptions 2, 3, and 5, see [44] for details.

### 4.1 FROST (Fast Row-stochastic Optimization with uncoordinated STep-sizes)

*i*at the

*k*th iteration maintains three variables, \(\mathbf {x}_{k}^{i},\mathbf {z}_{k}^{i}\in \mathbb {R}^{p}\), and \(\mathbf {y}_{k}^{i}\in \mathbb {R}^{n}\). At

*k*+1-th iteration, agent

*i*performs the following update:

*α*

_{i}’s are the uncoordinated step-sizes locally chosen at each agent and the row-stochastic weights, \(\underline {A}=\left \{a_{ij}\right \}\), respect the graph topology such that:

The algorithm is initialized with an arbitrary \(\mathbf {x}_{0}^{i}\), \(\mathbf {y}_{0}^{i}=\mathbf {e}_{i}\), and \(\mathbf {z}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\). We point out that the initial condition for Eq. (10a) and the divisions in Eq. (10c) require each agent to have a unique identifier. Clearly, Assumption 5 is applicable here. Note that Eq. (10c) is a modified gradient tracking update, first applied to optimization with row-stochastic weights in [45], where the divisions are used to eliminate the imbalance caused by the left Perron eigenvector of the (row-stochastic) weight matrix \(\underline {A}\). We note that the algorithm in [45] requires identical step-sizes at the agents and thus is a special case of Eqs. (10a)–(10c).

**x**

_{k},

**y**

_{k}, and ∇

**f**(

**x**

_{k}) collect the local variables \(\mathbf {x}_{k}^{i}\), \(\mathbf {y}_{k}^{i}\), and \(\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\) in a vector in \(\mathbb {R}^{np}\), respectively, and define

*k*. Based on the notation above, Eqs. (10a)–(10c) can be written compactly as follows:

where \(\underline {Y}_{0}=I_{n}\), **z**_{0}=∇**f**_{0}, and **x**_{0} is arbitrary. We emphasize that the implementation of FROST needs no knowledge of agent’s out-degree anywhere in the network in contrast to the earlier related work in [30, 31, 32, 33, 37, 39, 40, 42, 43]. Note that Refs. [22, 23] also use row-stochastic weights but require an additional locally balanced assumption and are only applicable to undirected graphs.

## 5 Convergence analysis

Since \(\underline {A}\) is primitive and row-stochastic, from the Perron-Frobenius theorem [49], we note that \(Y_{\infty } = \left (\mathbf {1}_{n}\boldsymbol {\pi }_{r}^{\top }\right)\otimes I_{p}\), where \(\boldsymbol {\pi }_{r}^{\top }\) is the left Perron eigenvector of \(\underline {A}\).

### 5.1 Auxiliary relations

We now start the convergence analysis with a key lemma regarding the contraction of the augmented weight matrix *A* under an arbitrary norm.

### **Lemma 1**

where 0<*σ*<1 is some constant.

### *Proof*

*A*

*Y*

_{∞}=

*Y*

_{∞}and

*Y*

_{∞}

*Y*

_{∞}=

*Y*

_{∞}, which leads to the following relation:

*A*−

*Y*

_{∞}∥<1 and a compatible vector norm, ∥·∥, see Chapter 5 in [49], such that

and the lemma follows with *σ*=∥*A*−*Y*_{∞}∥. □

As shown above, the existence of a norm in which the consensus process with row-stochastic matrix \(\underline {A}\) is a contraction does not follow the standard 2-norm argument for doubly stochastic matrices [28, 40]. The ensuing arguments built on this notion of contraction under arbitrary norms were first introduced in [39] for column-stochastic weights and in [45] for row-stochastic weights; these arguments are harmonized later to hold simultaneously for both row- and column-stochastic weights in [42, 43]. The next lemma, a direct consequence of the contraction introduced in Lemma 1, is a standard result from consensus and Markov chain theory [57].

### **Lemma 2**

*Y*

_{k}, generated from the weight matrix \(\underline {A}\). We have:

where *r* is some positive constant and *σ* is the contraction factor defined in Lemma 1.

### *Proof*

The proof follows from the fact that all matrix norms are equivalent. □

As a consequence of Lemma 2, we next establish the linear convergence of the sequences \(\left \{ \widetilde {Y}_{k}^{-1}\right \}\) and \(\left \{ \widetilde {Y}_{k+1}^{-1}-\widetilde {Y}_{k}^{-1}\right \}\).

### **Lemma 3**

The following inequalities hold \(\forall k: (a) \left \|\widetilde {Y}_{k}^{-1}-\widetilde {Y}_{\infty }^{-1}\right \|_{2}\leq \sqrt {n}r\widetilde {y}^{2}\sigma ^{k}\);

\((b) \left \|\widetilde {Y}_{k+1}^{-1}-\widetilde {Y}_{k}^{-1}\right \|_{2}\leq 2\sqrt {n}r\widetilde {y}^{2}\sigma ^{k}\).

### *Proof*

which completes the proof. □

The next lemma presents the dynamics that govern the evolution of the weighted sum of **z**_{k}; recall that **z**_{k}, in Eq. (11c), asymptotically tracks the average of local gradients, \(\frac {1}{n}\sum _{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\).

### **Lemma 4**

*k*:

### *Proof*

*Y*

_{∞}

*A*=

*Y*

_{∞}. We obtain from Eq. (11c) that

With the initial conditions that **z**_{0}=∇**f**(**x**_{0}) and \(\widetilde {Y}_{0}=I_{np}\), we complete the proof. □

The next lemma, a standard result in convex optimization theory from [58], states that the distance to the optimal solution contracts in each step in the centralized gradient method.

### **Lemma 5**

*μ*and

*l*be the strong convexity and Lipschitz continuity constants for the global objective function,

*F*(

**x**), respectively. Then \(\forall \mathbf {x}\in \mathbb {R}^{p}\) and \(0<\alpha <\frac {2}{l}\), we have

*σ*

_{F}= max(|1−

*α*

*μ*|,|1−

*α*

*l*|).

With the help of the previous lemmas, we are ready to derive a crucial contraction relationship in the proposed algorithm.

### 5.2 Contraction relationship

Our strategy to show convergence is to bound ∥**x**_{k+1}−*Y*_{∞}**x**_{k+1}∥, ∥*Y*_{∞}**x**_{k+1}−**1**_{n}⊗**x**^{∗}∥_{2}, and ∥**z**_{k+1}−*Y*_{∞}**z**_{k+1}∥ as a linear function of their values in the last iteration and ∇**f**(**x**_{k}); this approach extends the work in [28] on doubly stochastic weights to row-stochastic weights. We will present this relationship in the next lemmas. Before we proceed, we note that since all vector norms are equivalent in \(\mathbb {R}^{np}\), there exist positive constants *c*,*d* such that: ∥·∥_{2}≤*c*∥·∥,∥·∥≤*d*∥·∥_{2}. First, we derive a bound for ∥**x**_{k+1}−*Y*_{∞}**x**_{k+1}∥, the consensus error of the agents.

### **Lemma 6**

*k*:

where *d* is the equivalence-norm constant such that ∥·∥≤*d*∥·∥_{2} and \(\overline {\alpha }\) is the largest step-size among the agents.

### *Proof*

*Y*

_{∞}

*A*=

*Y*

_{∞}. Using Eq. (11b) and Lemma 1, we have:

which completes the proof. □

Next, we derive a bound for ∥*Y*_{∞}**x**_{k+1}−**1**_{n}⊗**x**^{∗}∥_{2}, i.e., the optimality gap between the accumulation state of the network, *Y*_{∞}**x**_{k+1}, and the optimal solution, **1**_{n}⊗**x**^{∗}.

### **Lemma 7**

*k*:

where \(\lambda =\max \left (\left |1-n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\mu \right |,\left |1-n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }l \right |\right)\) and *c* is the equivalence-norm constant such that ∥·∥_{2}≤*c*∥·∥.

### *Proof*

*Y*

_{∞}

*A*=

*Y*

_{∞}, we have the following:

where \(\lambda =\max \left (\left |1-n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\mu \right |,\left |1-n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }l \right |\right)\).

*s*

_{2}

*s*

_{3}as

where we use Lemma 3. Combining Eqs. (15)–(20), we finish the proof. □

Next, we bound ∥**z**_{k+1}−*Y*_{∞}**z**_{k+1}∥, the error in gradient estimation.

### **Lemma 8**

*k*

### *Proof*

**x**

_{k+1}−

**x**

_{k}∥

_{2}.

where in the second inequality, we use the fact that (*A*−*I*_{np})*Y*_{∞} is a zero matrix. Combining Eqs. (21)–(23), we obtain the desired result. □

The last step is to bound ∥**z**_{k}∥_{2} in terms of ∥**x**_{k}−*Y*_{∞}**x**_{k}∥, ∥*Y*_{∞}**x**_{k}−**1**_{n}⊗**x**^{∗}∥_{2}, and ∥**z**_{k}−*Y*_{∞}**z**_{k}∥. Then, we can replace ∥**z**_{k}∥_{2} in Lemmas 6 and 8 by this bound in order to develop a LTI system inequality.

### **Lemma 9**

*k*:

### *Proof*

where in the second inequality, we use the fact that \(\left (\mathbf {1}_{n}^{\top }\otimes I_{p}\right)\nabla \mathbf {f}(\mathbf {x}^{*})=0\), which is the optimality condition for Problem P1. □

Before the main result, we present an additional lemma from nonnegative matrix theory that will be helpful in establishing the linear convergence of FROST.

### **Lemma 10**

(Theorem 8.1.29 in [49]) Let \(X\in \mathbb {R}^{n\times n}\) be a nonnegative matrix and \(\mathbf {x}\in \mathbb {R}^{n}\) be a positive vector. If *X***x**<*ω***x**, then *ρ*(*X*)<*ω*.

### 5.3 Main results

With the help of the auxiliary relationships developed in the previous subsection, we now present the main results as follows in Theorems 1 and 2. Theorem 1 states that the relationships derived in the previous subsection indeed provide a contraction when the largest step-size, \(\overline {\alpha }\), is sufficiently small. Theorem 2 then establishes the linear convergence of FROST.

### **Theorem 1**

*a*

_{i}’s are

**π**_{r}]

_{−}be the smallest element in

**π**_{r}. When the largest step-size, \(\overline {\alpha }\), satisfies

*δ*

_{1},

*δ*

_{2},

*δ*

_{3}such that

then the spectral radius of *J*_{α} is strictly less than 1.

### *Proof*

*μ*≤

*l*[59]. In order to make \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {1}{nl}\) hold, it is suffice to require \(\overline {\alpha }<\frac {1}{nl}\). The next step is to find an upper bound, \(\hat {\alpha }\), on the largest step-size such that

*ρ*(

*J*

_{α})<1 when \(\overline {\alpha }<\hat {\alpha }.\) In the light of Lemma 10, we solve for the range of the largest step-size, \(\overline {\alpha }\), and a positive vector

*=[*

**δ***δ*

_{1},

*δ*

_{2},

*δ*

_{3}]

^{⊤}from the following:

*δ*

_{2}such that the second inequality holds, it suffices to solve for the range of

*δ*

_{2}such that the following inequality holds:

*r*]

_{−}is the smallest entry in

**π**_{r}. Therefore, as long as

where the range of *δ*_{1} and *δ*_{2} is given in Eqs. (31) and (32), respectively, and *δ*_{3} is an arbitrary positive constant and the theorem follows. □

Note that *δ*_{1},*δ*_{2},*δ*_{3} are essentially adjustable parameters that are chosen independently from the step-sizes. Specifically, according to Eq. (28), we first choose an arbitrary positive constant *δ*_{3} and subsequently choose a constant *δ*_{1} such that \(0< \delta _{1} < \frac {(1-\sigma)\delta _{3}}{a_{6}}\) and finally we choose a constant *δ*_{2} such that \(\delta _{2}>\frac {a_{4}\delta _{1}+a_{5}\delta _{3}}{\mu n[\boldsymbol {\pi }_{r}]_{-}}\).

### **Theorem 2**

*ξ*is an arbitrarily small constant,

*σ*is the contraction factor defined in Lemma 1, and

*m*is some positive constant.

Noticing that *ρ*(*J*_{α})<1 when the largest step-size, \(\overline {\alpha }\), follows the bound in Eq. (27) and that *H*_{k} linearly decays at the rate of *σ*^{k}, one can intuitively verify Theorem 2. A rigorous proof follows from [45].

In Theorems 1 and 2, we establish the linear convergence of FROST when the largest step-size, \(\overline {\alpha }\), follows the upper bound defined in Eq. (27). Distributed optimization (based on gradient tracking) with uncoordinated step-sizes have been previously studied in [26, 46, 47], over undirected graphs with doubly stochastic weights, and in [48], over directed graphs with column-stochastic weights. These works rely on some notion of heterogeneity of the step-sizes, defined respectively as the relative deviation of the step-sizes from their average, \(\frac {\|(I_{n}-U)\boldsymbol {\alpha }\|_{2}}{\|U\boldsymbol {\alpha }\|_{2}}\), where \(U=\mathbf {1}_{n}\mathbf {1}_{n}^{\top }/n\), in [26, 46], and as the ratio of the largest to the smallest step-size, \(\frac {\max _{i}\{\alpha _{i}\}}{\min _{i}\{\alpha _{i}\}}\), in [47, 48]. The authors then show that when the heterogeneity is small enough and when the largest step-size follows a bound that is a function of the heterogeneity, the proposed algorithms converge to the optimal solution. It is worth noting that sufficiently small step-sizes cannot guarantee sufficiently small heterogeneity in both of the aforementioned definitions. In contrast, the upper bound on the largest step-size in this paper, Eqs. (27) and (28), is independent of any notion of heterogeneity and only depends on the objective functions and the network parameters^{3}. Each agent therefore locally picks a sufficiently small step-size independent of other step-sizes. Besides, this bound allows the agents to choose a zero step-size as long as at least one of them is positive and sufficiently small.

## 6 Numerical results

*i*has access to

*m*

_{i}training data, \((\mathbf {c}_{ij},y_{ij})\in \mathbb {R}^{p}\times \{-1,+1\}\), where

**c**

_{ij}contains

*p*features of the

*j*th training data at agent

*i*and

*y*

_{ij}is the corresponding binary label. The network of agents cooperatively solves the following distributed logistic regression problem:

**c**

_{ij}’s, are randomly generated from some Gaussian distribution with zero mean. The binary labels are randomly generated from some Bernoulli distribution. The network topology is shown in Fig. 1. We adopt a simple uniform weighting strategy to construct the row- and column-stochastic weights when needed: \(a_{ij}=1/|\mathcal {N}_{i}^{{ \text {in}}}|,~b_{ij}=1/|\mathcal {N}_{j}^{{ \text {out}}}|,~\forall i,j\). We plot the average of residuals at each agent, \(\frac {1}{n}\sum _{i=1}^{n}\|\mathbf {x}_{i}(k)-\mathbf {x}^{*}\|_{2}\). In Fig. 2 (left), each curve represents the linear convergence of FROST when the corresponding agent uses a positive step-size, optimized manually, while every other agent uses zero step-size.

In Fig. 2 (right), we compare the performance of FROST, with ADD-OPT/Push-DIGing [39, 40], see Section 3.2.3, and with the \(\mathcal {AB}\) algorithm in [42, 43], see Section 3.2.4. The step-size used in each algorithm is optimized. For FROST, we first manually find the optimal identical step-size for all agents, which is 0.07 in our experiment, and then randomly generate uncoordinated step-sizes of FROST from the uniform distribution over the interval [0,0.07] (therefore, the convergence speed of FROST shown in this experiment is conservative). The numerical experiments thus verify our theoretical finding that as long as the largest step-size of FROST is positive and sufficiently small, FROST linearly converges to the optimal solution.

*n*=50 nodes, where \(\mathcal {G}_{1}\) has roughly 10% of total edges, \(\mathcal {G}_{2}\) has roughly 13% of total edges, and \(\mathcal {G}_{3}\) has roughly 16% of total edges. These graphs are shown in Fig. 3, and the performance of FROST over each one of them is shown in Fig. 4.

## 7 Conclusions

In this paper, we consider distributed optimization applicable to both directed and undirected graphs with row-stochastic weights and when the agents in the network have uncoordinated step-sizes. Most of the existing algorithms are based on column-stochastic weights, which may be infeasible to implement in many practical scenarios. Row-stochastic weights, on the other hand, are straightforward to implement as each agent locally determines the weights assigned to each incoming information. We propose a fast algorithm that we call FROST (Fast Row-stochastic Optimization with uncoordinated STep-sizes) and show that when the largest step-size is positive and sufficiently small, FROST linearly converges to the optimal solution. Simulation results further verify the theoretical analysis.

## Footnotes

## Notes

### Acknowledgments

Not applicable.

### Funding

This work has been partially supported by an NSF Career Award # CCF-1350264. None of the authors have any competing interests in the manuscript.

### Availability of data and materials

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

### Authors’ contributions

All authors contributed equally to this paper. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

- 1.P. A. Forero, A. Cano, G. B. Giannakis, Consensus-based distributed support vector machines. J. Mach. Learn. Res.
**11**(May), 1663–1707 (2010).MathSciNetzbMATHGoogle Scholar - 2.S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.
**3**(1), 1–122 (2011). https://doi.org/10.1561/2200000016.zbMATHCrossRefGoogle Scholar - 3.H. Raja, W.U. Bajwa, Cloud K-SVD: A collaborative dictionary learning algorithm for big, distributed data. IEEE Trans. Signal Process.
**64**(1), 173–188 (2016).MathSciNetCrossRefGoogle Scholar - 4.H. -T. Wai, Z. Yang, Z. Wang, M. Hong, Multi-agent reinforcement learning via double averaging primal-dual optimization. arXiv preprint arXiv:1806.00877 (2018).Google Scholar
- 5.P. Di Lorenzo, AH. Sayed, Sparse distributed learning based on diffusion adaptation. IEEE Trans. Signal rocess.
**61**(6), 1419–1433 (2013).MathSciNetzbMATHCrossRefGoogle Scholar - 6.S. Scardapane, R. Fierimonte, P. Di Lorenzo, M. Panella, A. Uncini, Distributed semi-supervised support vector machines. Neural Netw.
**80:**, 43–52 (2016).CrossRefGoogle Scholar - 7.A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Trans. Autom. Control.
**48**(6), 988–1001 (2003).MathSciNetzbMATHCrossRefGoogle Scholar - 8.G. Mateos, J. A. Bazerque, G. B. Giannakis, Distributed sparse linear regression. IEEE Trans. Signal Process.
**58**(10), 5262–5276 (2010). https://doi.org/10.1109/TSP.2010.2055862.MathSciNetzbMATHCrossRefGoogle Scholar - 9.J. A. Bazerque, G. B. Giannakis, Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans. Signal Process.
**58**(3), 1847–1862 (2010). https://doi.org/10.1109/TSP.2009.2038417.MathSciNetzbMATHCrossRefGoogle Scholar - 10.M. Rabbat, R. Nowak, in
*3rd International Symposium on Information Processing in Sensor Networks*. Distributed optimization in sensor networks (IEEE, 2004), pp. 20–27. https://doi.org/10.1109/IPSN.2004.1307319. - 11.S. Safavi, U. A. Khan, S. Kar, J. M. F. Moura, Distributed localization: a linear theory. Proc. IEEE.
**106**(7), 1204–1223 (2018). https://doi.org/10.1109/JPROC.2018.2823638.CrossRefGoogle Scholar - 12.J. Tsitsiklis, D. P. Bertsekas, M. Athans, Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control.
**31**(9), 803–812 (1986).MathSciNetzbMATHCrossRefGoogle Scholar - 13.A. Nedić, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control.
**54**(1), 48–61 (2009). https://doi.org/10.1109/TAC.2008.2009515.MathSciNetzbMATHCrossRefGoogle Scholar - 14.K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim.
**26**(3), 1835–1854 (2016).MathSciNetzbMATHCrossRefGoogle Scholar - 15.A. S. Berahas, R. Bollapragada, N. S. Keskar, E. Wei, Balancing communication and computation in distributed optimization. arXiv preprint arXiv:1709.02999 (2017).Google Scholar
- 16.H. Terelius, U. Topcu, R. M. Murray, Decentralized multi-agent optimization via dual decomposition. IFAC Proc. Vol.
**44**(1), 11245–11251 (2011).CrossRefGoogle Scholar - 17.J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, M. Puschel, D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process.
**61**(10), 2718–2723 (2013). https://doi.org/10.1109/TSP.2013.2254478.MathSciNetzbMATHCrossRefGoogle Scholar - 18.E. Wei, A. Ozdaglar, in
*51st IEEE Annual Conference on Decision and Control*. Distributed alternating direction method of multipliers (IEEE, 2012), pp. 5445–5450. https://doi.org/10.1109/CDC.2012.6425904. - 19.W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process.
**62**(7), 1750–1761 (2014). https://doi.org/10.1109/TSP.2014.2304432.MathSciNetzbMATHCrossRefGoogle Scholar - 20.D. Jakovetic, J. Xavier, J. M. F. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control.
**59**(5), 1131–1146 (2014).MathSciNetzbMATHCrossRefGoogle Scholar - 21.W. Shi, Q. Ling, G. Wu, W. Yin, Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim.
**25**(2), 944–966 (2015). https://doi.org/10.1137/14096668X. http://arxiv.org/abs/http://dx.doi.org/10.1137/14096668X.MathSciNetzbMATHCrossRefGoogle Scholar - 22.K. Yuan, B. Ying, X. Zhao, A. H. Sayed, Exact diffusion for distributed optimization and learning - part I: algorithm development. IEEE Trans. Signal Process, 1 (2018). https://doi.org/10.1109/TSP.2018.2875898.
- 23.K. Yuan, B. Ying, X. Zhao, A. H. Sayed, Exact diffusion for distributed optimization and learning - part II: convergence analysis. IEEE Trans. Signal Process.https://doi.org/10.1109/TSP.2018.2875883 (2018).Google Scholar
- 24.A. H. Sayed, in
*Academic Press Library in Signal Processing vol. 3*. Diffusion adaptation over networks (ElsevierAmsterdam, 2014), pp. 323–453.CrossRefGoogle Scholar - 25.M. Zhu, S. Martínez, Discrete-time dynamic average consensus. Automatica.
**46**(2), 322–329 (2010).MathSciNetzbMATHCrossRefGoogle Scholar - 26.J. Xu, S. Zhu, Y. C. Soh, L. Xie, in
*IEEE 54th Annual Conference on Decision and Control*. Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes (IEEE, 2015), pp. 2055–2060.Google Scholar - 27.P. D. Lorenzo, G. Scutari, in
*2016 IEEE International Conference on Acoustics, Speech and Signal Processing*. Distributed nonconvex optimization over time-varying networks (IEEE, 2016), pp. 4124–4128.Google Scholar - 28.G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst.
**5**(3), 1245–1260 (2018). https://doi.org/10.1109/TCNS.2017.2698261.MathSciNetCrossRefGoogle Scholar - 29.P. Di Lorenzo, G. Scutari, Next: In-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Over Netw.
**2**(2), 120–136 (2016).MathSciNetCrossRefGoogle Scholar - 30.K. I. Tsianos, S. Lawlor, M. G. Rabbat, in
*51st IEEE Annual Conference on Decision and Control*. Push-sum distributed dual averaging for convex optimization (IEEE, 2012), pp. 5453–5458. https://doi.org/10.1109/CDC.2012.6426375. - 31.A. Nedić, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control.
**60**(3), 601–615 (2015).MathSciNetzbMATHCrossRefGoogle Scholar - 32.C. Xi, Q. Wu, U. A. Khan, On the distributed optimization over directed networks. Neurocomputing.
**267:**, 508–515 (2017).CrossRefGoogle Scholar - 33.C. Xi, U. A. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans. Autom. Control.
**62**(8), 3986–3992 (2016).MathSciNetzbMATHCrossRefGoogle Scholar - 34.M. Assran, M. Rabbat, Asynchronous subgradient-push. arXiv preprint arXiv:1803.08950 (2018).Google Scholar
- 35.J. Zhang, K. You, Asyspa: An exact asynchronous algorithm for convex optimization over digraphs. arXiv preprint arXiv:1808.04118 (2018).Google Scholar
- 36.A. Olshevsky, I. C. Paschalidis, A. Spiridonoff, Robust asynchronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions. arXiv preprint arXiv:1811.03982 (2018).Google Scholar
- 37.C. Xi, U. A. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans. Autom. Control.
**62**(10), 4980–4993 (2017).MathSciNetzbMATHCrossRefGoogle Scholar - 38.D. Kempe, A. Dobra, J. Gehrke, in
*44th Annual IEEE Symposium on Foundations of Computer Science*. Gossip-based computation of aggregate information (IEEE, 2003), pp. 482–491. https://doi.org/10.1109/SFCS.2003.1238221. - 39.C. Xi, R. Xin, U. A. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans. Autom. Control.
**63**(5), 1329–1339 (2017).MathSciNetzbMATHCrossRefGoogle Scholar - 40.A. Nedić, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim.
**27**(4), 2597–2633 (2017).MathSciNetzbMATHCrossRefGoogle Scholar - 41.Y. Sun, G. Scutari, D. Palomar, in
*2016 50th Asilomar Conference on Signals, Systems and Computers*. Distributed nonconvex multiagent optimization over time-varying networks (IEEEPacific Grove, 2016), pp. 788–794.CrossRefGoogle Scholar - 42.R. Xin, U. A. Khan, A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett.
**2**(3), 325–330 (2018).CrossRefGoogle Scholar - 43.R. Xin, U. A. Khan, Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. arXiv preprint arXiv:1808.02942 (2018).Google Scholar
- 44.V. S. Mai, E. H. Abed, in
*2016 American Control Conference (ACC)*. Distributed optimization over weighted directed graphs using row stochastic matrix (IEEE, 2016), pp. 7165–7170. https://doi.org/10.1109/ACC.2016.7526803. - 45.C. Xi, V. S. Mai, R. Xin, E. Abed, U. A. Khan, Linear convergence in optimization over directed graphs with row-stochastic matrices. IEEE Trans. Autom. Control. (2018). in press.Google Scholar
- 46.J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control.
**63**(2), 434–448 (2018).MathSciNetzbMATHCrossRefGoogle Scholar - 47.A. Nedić, A. Olshevsky, W. Shi, C. A. Uribe, in
*2017 American Control Conference (ACC)*. Geometrically convergent distributed optimization with uncoordinated step-sizes (IEEE, 2017), pp. 3950–3955. https://doi.org/10.23919/ACC.2017.7963560. - 48.Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-varying directed graphs and uncoordinated step-sizes. Inf. Sci.
**422:**, 516–530 (2018).MathSciNetCrossRefGoogle Scholar - 49.R. A. Horn, C. R. Johnson,
*Matrix Analysis, 2nd ed*(Cambridge University Press, New York, NY, 2013).Google Scholar - 50.H. W. Kuhn, The Hungarian method for the assignment problem. Nav. Res. Logist. Q.
**2**(1–2), 83–97 (1955).MathSciNetzbMATHCrossRefGoogle Scholar - 51.S. Safavi, U. A. Khan, in
*48th IEEE Asilomar Conference on Signals, Systems, and Computers*. On the convergence rate of swap-collide algorithm for simple task assignment (IEEE, 2014), pp. 1507–1510.Google Scholar - 52.M. Zhu, S. Martínez, Discrete-time dynamic average consensus. Automatica.
**46**(2), 322–329 (2010).MathSciNetzbMATHCrossRefGoogle Scholar - 53.F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, M. Vetterli, in
*IEEE International Symposium on Information Theory*. Weighted gossip: distributed averaging using non-doubly stochastic matrices (IEEE, 2010), pp. 1753–1757. https://doi.org/10.1109/ISIT.2010.5513273. - 54.F. Saadatniaki, R. Xin, U.A. Khan, Optimization over time-varying directed graphs with row and column-stochastic matrices. arXiv preprint arXiv:1810.07393 (2018).Google Scholar
- 55.K. Cai, H. Ishii, Average consensus on general strongly connected digraphs. Automatica.
**48**(11), 2750–2761 (2012). https://doi.org/10.1016/j.automatica.2012.08.003.MathSciNetzbMATHCrossRefGoogle Scholar - 56.T. Yang, J. George, J. Qin, X. Yi, J. Wu, Distributed finite-time least squares solver for network linear equations. arXiv preprint arXiv:1810.00156 (2018).Google Scholar
- 57.R. A. Horn, C. R. Johnson,
*Matrix Analysis*(Cambridge University Press, New York, NY, 2013).zbMATHGoogle Scholar - 58.D. P. Bertsekas,
*Nonlinear Programming*(Athena scientific Belmont, Belmont, 1999).zbMATHGoogle Scholar - 59.Y. Nesterov,
*Introductory Lectures on Convex Optimization: A Basic Course vol. 87*(Springer, New York, 2013).Google Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.