A distributed gradient algorithm based on randomized block-coordinate and projection-free over networks

The computational bottleneck in distributed optimization methods, which is based on projected gradient descent, is due to the computation of a full gradient vector and projection step. This is a particular problem for large datasets. To reduce the computational complexity of existing methods, we combine the randomized block-coordinate descent and the Frank–Wolfe techniques, and then propose a distributed randomized block-coordinate projection-free algorithm over networks, where each agent randomly chooses a subset of the coordinates of its gradient vector and the projection step is eschewed in favor of a much simpler linear optimization step. Moreover, the convergence performance of the proposed algorithm is also theoretically analyzed. Specifically, we rigorously prove that the proposed algorithm can converge to optimal point at rate of O(1/t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(1/t)$$\end{document} under convexity and O(1/t2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(1/t^2)$$\end{document} under strong convexity, respectively. Here, t is the number of iterations. Furthermore, the proposed algorithm can converge to a stationary point, where the “Frank-Wolfe” gap is equal to zero, at a rate O(1/t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(1/\sqrt{t})$$\end{document} under non-convexity. To evaluate the computational benefit of the proposed algorithm, we use the proposed algorithm to solve the multiclass classification problems by simulation experiments on two datasets, i.e., aloi and news20. The results shows that the proposed algorithm is faster than the existing distributed optimization algorithms due to its lower computation per iteration. Furthermore, the results also show that well-connected graphs or smaller graphs leads to faster convergence rate, which can confirm the theoretical results.


Introduction
In this paper, we focus on constrained optimization problems over networks consisting of multiple agents, where the global objective function is a sum of local functions of all agents. These optimization problems have recently received great attentions and has arisen in many applications such as resource allocation [1][2][3], large-scale machine learning [4,5], distributed spectrum sensing in cognitive radio networks [6], estimation in sensor networks [7,8], coordination in multi-agent systems [9,10], and power system control [11,12]. Thus, the design of an optimization algorithm is necessary to solve such problems. Besides, we assume that each agent only knows its own objective function and can exchange information with its neighbors over networks. For this reason, efficient distributed optimization algorithms are necessitated using local communication and local computation over networks.
The seminal work concerning this problems was introduced in [13] (see also [14,15]). Recently, Nedić et al. [16] proposed a distributed subgradient algorithm, which performs consensus step and descent step. Duchi et al. [17] proposed a distributed dual average method using a similar idea. Moreover, variants of the distributed subgradient algorithm can be also found in [18][19][20][21][22][23][24]. However, the pro-jection step becomes prohibitive when dealing with massive data sets for solving the constrained optimization problem. To reduce the computational bottleneck of the projection step, the Frank-Wolfe algorithm (a.k.a. conditional gradient decent) was proposed in [25]. In the Frank-Wolfe algorithm, the projection step is eschewed by a more efficient linear optimization step. Recently, Frank-Wolfe algorithms have received much attention due to its versatility and simplicity [26]. Furthermore, variants of Frank-Wolfe methods can be found in [27][28][29][30]. In addition, a decentralized Frank-Wolfe algorithm over networks was presented in [31]. However, the full gradient vectors are employed at each iteration in the above methods.
Despite this progress, the computation of the full gradient vectors is a computational bottleneck for high-dimensional vectors. Therefore, the variants of Frank-Wolfe methods, which use the full gradient vectors to update the decision vectors, can be prohibitive for tackling high-dimensional data. Furthermore, the computation of the appropriate oracle can be prohibitive for high-dimensional data in each Frank-Wolfe iteration. For this reason, Lacoste-Julien et al. [32] presented a block-coordinate Frank-Wolfe method. Moreover, the variants of the work were prevailed in [33][34][35], which have been applied in many fields. Despite their success, nevertheless, the computation models of these algorithms belong to the centralized framework.
We have recently witnessed the rise of big data, which is high dimensional. Moreover, these data are sat over different networked machines. Therefore, distributed variants of block-coordinate Frank-Wolfe algorithms are desirable and necessary for tackling unprecedented dimensional optimization problems [36]. For this reason, we expect that the desired algorithm can greatly reduce the computational complexity by avoiding some expensive operations such as the computation of full gradient and projection operation at each iteration. Very recently, Zhang et al. [37] proposed a distributed algorithm for maximizing the submodular function by leveraging randomized block-coordinate and Frank-Wolfe methods, in which each local objective function needs to satisfy the diminishing returns property. Nonetheless, the objective function may not satisfy this property in some applications. For example, the loss function may be convex in multi-task learning and may be non-convex in deep learning. However, the distributed block-coordinate Frank-Wolfe variant over networks for convex or non-convex functions is barely known. Furthermore, the design and analysis of the variant remained hitherto an open problem.
To fill this gap, we propose a novel distributed randomized block-coordinate projection-free algorithm over networks. In the proposed algorithm, each agent randomly chooses a subset of the entries of gradient vector and moves along the gradient direction at each iteration, the projection step is replaced by the Frank-Wolfe step. Therefore, the computa-tional burden is reduced for solving huge-scale constrained optimization problems. Furthermore, the proposed algorithm also suits the case that the structure of information is incomplete. For instance, the data are spread among the agents of the network. In addition, the convergence rate of our algorithm is theoretically analyzed for huge-scale constrained convex and non-convex optimization problems, respectively.
The main contributions of this paper are as follows: 1) We propose a distributed randomized block-coordinate projection-free algorithm over networks, where local communication and computation are adopted. The algorithm uses the block-coordinate descent and the Frank-Wolfe techniques to reduce the computational cost of the entire gradient vector and the projection step, respectively. 2) We theoretically analyze the convergence rate of our algorithm. datasets to evaluate the performance of our algorithm and confirm the theoretical results. The remainder of the paper is organized as follows: in "Related work", we review some related works. In "Problem formulation, algorithms design, and assumptions", we first formulate the optimization problem; our algorithm is presented and the standard assumptions are also provided. In "Main results", we describe the main results of the work. In "Convergence analysis", we analyze the convergence properties of the proposed algorithm and prove the main results in detail. The performance of our designed algorithm is evaluated in "Experiments". The conclusion of the paper is summarized in "Conclusion".

Notation:
We use boldface to denote the vector with suitable dimension. Scalars are denoted by normal font. We use R to denote the set of real numbers. Moreover, the symbol R d denotes the set of real vectors with dimension d. The notation R d×d denotes the real matrix of size d × d. The notation · denotes the standard Euclidean norm. The transpose operation of a vector x and a matrix A are designated as x and A , respectively. The notation x, y denotes the inner product of vectors x and y. The identity matrix with suitable size is designated as I . The vector, in which all entries are 1, is designated as 1. Moreover, the expectation of a random variable X is designated as E[X ]. The main notations of this paper are summarized in Table 1.

Related work
Distributed optimization over networks is a challenging problem, where each agent only utilizes its local information. The The weight between agent i and agent j The second largest eigenvalue of A γ t Step size The probability of block-coordinate selected by agent i framework of distributed computation models was developed in the seminal work [13], see also [14,15]. In this framework, the goal is to minimize a common (smooth) function by communication. In contrast, a distributed subgradient descent method was presented in [16]. The objective is to minimize the sum of local functions by local communication and local computation. Its variants were developed in [17][18][19][20][21][22][23]. Furthermore, Chen et al. [24] developed distributed subgradient algorithm for weakly convex functions. To achieve fast convergence, accelerated distributed gradient descent algorithms were presented in [38][39][40][41][42]. Meanwhile, the distributed primal-dual algorithms were also developed in [43]. Moreover, Newton's algorithms were developed in [44,45], and quasi-Newton methods were provided in [46]. In addition, decentralized ADMM methods are considered in [47] and [48]. However, the projections is prohibitively expensive for massive data sets. Thus, the Frank-Wolfe method, which was presented in [25], is an efficient methods for solving largescale optimization problems. In the Frank-Wolfe method, the projection step is replaced by a very efficient linear optimization step. The primal-dual convergence rate was analyzed in detail for Frank-Wolfe-type methods [26]. Furthermore, variants of Frank-Wolfe methods was found in [26][27][28][29][30]. In addition, Wai et al. [31] proposed a decentralized Frank-Wolfe algorithm over networks. Nevertheless, these methods need to compute the full gradient at each iteration.
For high-dimensional data, however, the computations of the entire gradient are prohibitive. To reduce the computational burden, coordinate-descent methods were studied in [49], where a subset of the entries of the gradient vector is updated at each iteration. Thus, the main difference among coordinate descent algorithms is the criteria of choosing the coordinate of the gradient vector. In these methods, the maximal and the cyclic coordinate search was often used [49]. Nevertheless, the convergence is difficult to prove for the cyclic coordinate search [50], and the convergence rate is trivial for the maximal coordinate search [49]. In addition, Nesterov studied random coordinate descent method in [50], where the choice of the coordinate was random. In [51], the authors extended the method to composite functions. Furthermore, the parallel coordinate descent methods were also well studied in [52,53]. In [54], the authors proposed a random block-coordinate gradient projection algorithms. Wang et al. [55] studied coordinate-descent diffusion learning over networks. Notarnicola et al. [56] proposed a blockwise gradient tracking method for distributed optimization. Besides, coordinate primal-dual variants for distributed optimization were also investigated in [57,58].
Further, a block Frank-Wolfe method by combining the coordinate descent method and the Frank-Wolfe technique was proposed in [32] and extensions of the work were prevailed in [33][34][35]37]. To the best of our knowledge, distributed block-coordinate Frank-Wolfe variants over networks for convex or non-convex functions have rarely been investigated. For this reason, this paper focuses on the design and analysis of these variants. The comparison of different algorithms is summarized in Table 2.

Problem formulation, algorithms design, and assumptions
. . , n} denotes the set of agents and E ⊂ V × V designates the edge set. The notation (i, j) ∈ E designates an edge, where agent i can send information to agent j, i, j = 1, . . . , n. We use the notation N i to designate the neighborhood of agent i. The constrained optimization problem of this paper is formulated as follows: where f i : X → R refers to the cost function of agent i for all i ∈ V, and X ⊆ R d denotes a constraint set.
Moreover, this paper assumes that the dimensionality d of the vector x is large. To solve the problem (1), distributed gradient descent (DGD) methods are proposed in recent years. For tackling the high-dimensional data, however, the computation of the full gradient is expensive and becomes a bottleneck. Furthermore, the projection step is also expensive and may become prohibitive in many computationally intensive applications. To alleviate this computational challenge, we propose a distributed randomized block-coordinate Frank-Wolfe algorithm to solve problem (1) for high-dimensional data.
In this paper, we assume that the communication pattern among agents is defined by a n-by-n weighted matrix, A := a i j n×n . Moreover, suppose that To reduce the computational bottleneck, each agent at each iteration randomly chooses a subset of the gradient vector. Therefore, the proposed algorithm is summarized in Algorithm 1. First, each agent i, i = 1, . . . , n, performs a consensus step, i.e., Second, each agent i performs the following aggregating step: Algorithm 1 Distributed randomized block-coordinate projection-free algorithm over networks 1: Input: The number of agents n, the value x i (0) for all i ∈ {1, . . . , n}, and the matrix A.
for each agent i = 1, . . . , n do 6: Communicate with neighbors: Compute the aggregating value: Update the estimated parameter: where Q i (t) ∈ R d×d denotes a diagonal matrix. Moreover, the definition of the diagonal matrix is presented as follows: Finally, each agent i performs the following Frank-Wolfe step, i.e., and where γ t ∈ (0, 1] denotes a step size. Furthermore, we have the initial conditions By the definition of q i,t (k), we know that the k-th entry of the gradient vector is missing when q i,t (k) = 0, thus the k-th entry of s i (t) in Eq. (3) is updated without using the gradient information. Furthermore, the update can randomly vary over time and across agents. In addition, we use the more efficient linear optimization step to eschew projection.
In this paper, each agent sends information to its neighbors over network G. To ensure the dissemination of the information from all agents, we formalize the following assumption, which is a standard assumption in [60].
Assumption 2 Suppose that the network G is strongly connected.
From Assumption 2, we have |λ 2 (A)| < 1, where λ 2 (·) denotes the second largest eigenvalue of a matrix. Furthermore, for any x ∈ R n , from linear algebra, we obtain wherex = (1/n) 1 x. From Eq. (7), we can see that the averagex is computed at a linear rate by average consensus. Next, we introduce the smallest integer t 0,θ such that Therefore, following from Eq. (8), we obtain Besides, the following assumptions are also provided.

Assumption 3
The set X is bounded and convex. Moreover, the optimal set X * is nonempty.
Moreover, we define the diameter of X as follows:

Assumption 4
For any x, y ∈ X and i ∈ V, there exist positive constants β and L such that and Then, f i is β-smooth and L-Lipschitz.
From the Lipschitz condition, we have ∇ f i (x) ≤ L for any x ∈ X . Furthermore, the relation (10) is equivalent to In addition, a function f i is μ-strongly convex if the function f i satisfies the following condition: for μ > 0, holds for any x, y ∈ X . Moreover, by the definition of the function f , we also know that f is μ-strongly convex. Besides, we also introduce the following parameter: where B X designates the boundary set of X . From Eq. (12), the solution x * belongs to the interior of X if α > 0. Let F t denote the filtration of {x i (t)} generated by our algorithm described in Eqs. (2)-(6) up to time t at all agents. Assumption 5 is adopted on the random variables q i,t (k).

Assumption 5
The random variables q i,t (k) and q j,t (l) are mutually independent for all i, j, k, l. Furthermore, the random variables

Main results
To find the optimal solution of the problem (1), the optimal set is defined as where f * := min x∈X f (x). Besides, we introduce a variable, which is given by The first result shows the rate of convergence for convex cost functions. where . Furthermore, assume that α > 0 and all cost functions f i are μ-strongly convex. Then, for t ≥ 2, we have where ζ is a constant and is greater than 1.
The detailed proof is provided in "Convergence analysis". By Theorem 1, we can see that the rate of convergence is O(1/t) when the cost functions f i are convex. Furthermore, the rate of convergence is O(1/t 2 ) under strong convexity.
When each function f i is possibly non-convex, we will derive the convergence rate. To this end, we first introduce the "Frank-Wolfe" gap of f at x (t), From Eq. (15), we have (t) ≥ 0. Moreover, x(t) is a stationary point when (t) = 0 to the problem (1). The second result shows the rate of convergence for non-convex cost functions.
Theorem 2 Let Assumptions 1-5 hold. Suppose that each function f i is possibly non-convex and T is an even number. Moreover, γ t = 1/t θ with 0 < θ < 1. Then, for all T ≥ 6 and t ≥ t 0,θ , if θ ∈ [1/2, 1), we have where p max = max i∈V p i , p min = min i∈V p i , and C 2 = t 0,θ θ 2n √ p max β (D + 2C 1 ). If θ ∈ (0, 1/2), we have The detailed proof is provided in "Convergence analysis". By Theorem 2, if the cost functions f i are potentially nonconvex, we can see that the quickest rate of convergence is

Convergence analysis
In this section, we will analyze the convergence rate. To this end, we define some variables as Besides, we also obtain some equalities as follows: Proof (a) By Eq. (18), using Eq. (3) and the double stochasticity of A, we obtain By recursively applying Eq. (21), we obtain From the initial conditions s i (0) = ∇ f i (z i (0)) and Q i (0) = I d , we have Therefore, part (a) is proved completely. (b) Using the double stochasticity of A, we obtain Therefore, we finish the proof of part (b).
We now derive some important results, which are used in the convergence analysis.
Proof We first have the following relation: where we have used the property of the Euclidean norm to obtain the last inequality. Moreover, if the following inequality holds: where C 1 = t 0,θ D √ n, then the result is obtained by using Eq. (24). Therefore, we next prove that Eq. (25) holds. To this end, we use induction to prove the above inequality. Because the constraint set X is convex, then z i (t), x(t) ∈ X . Moreover, the set X is bounded by the diameter D, thus, Eq. (25) holds for t = 1 to t = t 0,θ . Further, suppose Eq. (25) holds for t ≥ t 0,θ . Since where we have used Eq. (7) to obtain the last inequality. Furthermore, using the Cauchy-Schwarz inequality, we also obtain n j=1 where the second inequality holds due to the boundedness of X ; the third inequality is from the inequality Using Eq. (28), we get Therefore, the induction step is finished. The result is proved completely.
Proof From the property of the norm, we first have the following inequality: where the first inequality is obtained using the inequality for any vector w ∈ R d , we have also used the properties of the norm to obtain the last inequality. Therefore, if the following inequality holds: then the result of this lemma is obtained using Eq. (30). To prove Eq. (31), a variable is defined as Plugging Eq. (32) into Eq. (3) implies that In addition, we have employed induction to derive Eq. (31). Following from Lemma 2 and the boundedness of gradients, we can know that Eq. (31) holds for t = 1 to t = t 0,θ . Then, we assume that Eq. (31) holds up to t ≥ t 0,θ . According to the definition of S i (t) and Eq. (33), we have where in the first inequality we have used the conclusion of part (a) in Lemma 1 and Eq. (7). Furthermore, we introduce a variable, i.e., Therefore, the term n i=1 S i (t) + i (t + 1) − g (t + 1) 2 can be bounded using the Cauchy-Schwarz inequality, i.e., Further, the term i (t + 1) − (t + 1) can be bounded as follows: where the last inequality is due to the inequality (a − b) 2 ≤ 2(a 2 +b 2 ) for a, b ∈ R d . In addition, by using the smoothness of f i and Eq. (32), we also obtain where the first inequality is derived by the definition of the matrix Q i (t) and using the smoothness of f i ; the second inequality is derived using the inequality ( n i=1 a i ) 2 ≤ n n i=1 a 2 i and the fact that a 2 i j ≤ a i j due to 0 ≤ a i j ≤ 1 for all i, j ∈ V; the third inequality is deduced using Eq. (6) and the triangle inequality; using Lemma 2 and the boundedness of X yields the fourth inequality; the last equality follows from Assumption 1.
Taking conditional expectation on both sides of Eq. (36), then applying Eq. (37), we obtain where p max = max i∈V p i . Taking conditional expectation on both sides of Eq. (35), and then using Eq. (38), the Cauchy-Schwarz inequality, and the definition of C 2 , we have where the following inequality: is used to derive the third inequality. Taking conditional expectation on both sides of Eq. (34) and then using Eq. (39), we deduce Furthermore, by Eq. (8), we have for t ≥ t 0,θ (42) which implies that Till now, we complete the induction step. Thus, Lemma 3 is proved completely. Now, we start to prove Theorem 1 using Lemmata 1-3.

Proof of Theorem 1.
Since each function f i is β-smooth, the function f is also β-smooth. Thus, using Lemma 1 and the boundedness of X , we have Furthermore, we also obtain that for i = 1, . . . , n and v ∈ X where the first equality holds by adding and subtracting S i (t); using v i (t) ∈ arg min v∈X v, S i (t) implies that the first inequality holds; the last inequality is derived by adding and subtracting 1 n n i=1 Q i (t) ∇ f i (x (t))) and using the fact that X is bounded. By taking expectation with respect to the random variables Q i (t) on Eq. (45) and using Assumption 5, we obtain To estimate Eq. (46), we need to estimate the term ) .
By adding and subtracting g (t), using the triangle inequality, we get Using Eq. (19) yields where the second inequality is obtained since all functions f i are β-smooth; the last inequality is due to Lemma 2. Combining Eqs. (29), (46), (47), and (48) yields Moreover, letting v =ṽ (t) ∈ arg min v∈X ∇ f (x (t)) , v in Eq. (49) and using p i = 1/2 for all i ∈ V, we further obtain Taking expectation conditional with respect to F t and using Eq. (49), we deduce Subtracting f (x * ) from both sides of Eq. (51) yields where the last inequality is derived using , v and the convexity of f , we deduce Letting h(t) := f (x(t)) − f (x * ), and then substituting Eq. (53) into Eq. (52), implies that where the last inequality is derived using f ( where using the relation 1/t − 1/ (t + 1) ≤ 1/t 2 yields the second inequality; the last inequality is due to the definition of κ, and specifically The induction step is completed and then the part (a) of Theorem 1 is proved.
For any t > t , Eq. (61) is less than or equal to 0. Therefore, we obtain For namely, Furthermore, let ζ = 2 and using the result of part (a) in Theorem 1, we have In addition, the inequality E h (t) | F t−1 ≤ κ/t holds for all t ≥ 2. Therefore, we obtain part (b).
We next prove Theorem 2.
Proof of Theorem 2. Using Eq. (15) and the fact Furthermore, (t) ≥ 0. Because of the smoothness of f with β, then Using Lemma 1 yields Employing the triangle inequality implies that where the last inequality is derived since Thus, Eq. (67) is bounded, i.e., Using Eq. (70) implies that By summing both sides of Eq. (71) over , we obtain In addition, since γ t ≥ 0 and (t) ≥ 0, then Using the expression γ t = t −θ , for T ≥ 6 and θ ∈ (0, 1), we get T t=T /2+1 When θ ≥ 1/2, we deduce that Plugging Eq. (75) into Eq. (72) implies that where When θ < 1/2, we also have Plugging Eq. (78) into Eq. (72) and using the Lipschitz condition of f , we deduce that where the last inequality is due to T 1−2θ ≤ 1 for all θ < where Therefore, the results of Theorem 2 is obtained.

Experiments
The proposed algorithm is used to solve a multiclass classification problem with different loss functions and a structural SVM problem for evaluating the performance of our designed algorithm, respectively. In addition, the experiments are run in Windows 10 equipped with 1080Ti GPU and 64 GB memory. Moreover, we implement the experiment program in MATLAB 2018a.

Multiclass classification
We first introduce the multiclass classification problems: The notation S = {1, . . . , } designates the set of classes, each agent i ∈ {1, . . . , n} has access to a data example d i (t) ∈ R d , which denotes a class in S, and needs to obtain a decision matrix X i (t) = x 1 ; . . . ; x ∈ R ×d . Furthermore, the class label is predicted by arg max h∈S x h d i (t). The local loss function of each agent i is defined as follows: where y i (t) designates the true class label. Moreover, the constraint set X = {X ∈ R ×d | X * ≤ δ}, where · * is the Frobenius norm of a matrix and δ is a positive constant.
In our experiments, we employ some datasets to test the performance of the designed algorithm. For this reason, two relatively large multiclass datasets are chosen from the LIB-SVM Data 1 . Table 3 presents the summary of these datasets. Besides, the parameters are set by the theories suggest. For this reason, the step-size is set to be 2/t in these experiments.

Experimental results
To demonstrate the performance advantage of our algorithm, we first compare the proposed algorithm with DeFW   [31], EXTRA [39], and DGD [40] on different datasets with n = 64. As depicted in Fig. 1, the convergence speed of our algorithm is faster than DeFW, EXTRA, and DGD on two datasets news20 and aloi. At each iteration, the computational cost of the proposed algorithm is lower than DeFW, EXTRA, and DGD, the number of iterations in our designed algorithm is increased for a running time. Therefore, the convergence speed is accelerated correspondingly.
To investigate the impact of the number of nodes on the performance of our algorithm, we run the proposed algorithm on complete graphs with different nodes. As depicted in Fig. 2, the larger size of graph leads to the slower convergence rate. Furthermore, the convergence performance of our algorithm is comparable to the centralized gradient descent algorithm.
To evaluate the impact of network topologies on the performance of our algorithm, we run the proposed algorithm on a complete graph, a random graph (Watts-Strogatz), and a cycle graph, respectively. Moreover, the number of nodes in these graphs are set to be n = 64. The results are depicted in Fig. 3. We find that the complete graph can lead to slightly faster convergence than the random graph and the cycle graph. In other words, the better connectivity leads to the faster convergence rate in the proposed algorithm.

Conclusion
This paper has presented a distributed randomized blockcoordinate Frank-Wolfe algorithm over networks for solving high-dimensional constrained optimization problems. Furthermore, detailed analyses of the convergence rate for our proposed algorithm are provided. Specifically, using a diminishing step-size, our algorithm converges at a rate of O(1/t) for convex objective functions; for the strongly convex objective functions, the convergence rate is O(1/t 2 ) when the optimal solution is an interior point of the constraint set. Moreover, our algorithm converges to a stationary point at a rate of O(1/ √ t) under non-convexity by employing a diminishing step-size. Finally, the theoretical results have been confirmed by experiments. The results have shown that our algorithm is faster than the existing distributed algorithms. In the future work, we will devise and analyze of distributed adaptive block-coordinate Frank-Wolfe algorithms with momentum for training rapidly distributed deep neural networks.

Data availability
The data that support the findings of this study are news20 and aloi which are available from http://www.csie.ntu.edu.tw/ cjlin/libsvm.

Conflict of interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and publication of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.