1 Introduction

In recent years, distributed optimization has received a surge of interest in diverse areas, including autonomous vehicle control [16], multi-agent systems [31] and sensor networks [1], due to its significant advantages in aspects of data privacy, robustness, flexibility, and scalability. Distributed optimization minimizes a joint function through local computation and communication between agents in a network. Recently, much effort has been dedicated to the distributed stochastic setting [11, 19, 29, 30], where each agent’s objective function is the expectation of a function with random variables that follow unknown distributions. Such situation widely exists in the machine learning [5, 19], multi-agent reinforcement learning [25, 27, 28], and unmanned systems [7, 31], to name a few. Most distributed algorithms for solving such problems require the explicit gradients of objective functions. However, the feedback available to agents is incomplete or noisy because of the environmental uncertainty in many practical applications. Hence, the real gradient feedback seems too strict in reality.

Zeroth-order optimization is a typical gradient-free method that has gained widespread concern due to its wide usage in many practical large-scale optimization tasks. In these tasks, the explicit gradient of the objective function is expensive or unavailable to obtain, and only function evaluations are accessible. For instance, the objective function of many big data problems in complex data generation processes cannot be clearly defined. Such situations include large-scale black-box adversarial attacks to deep networks [8], simulation-based modeling [20], and reinforcement learning [24], etc. Motivated by these applications, the design and analysis of zeroth-order algorithms become increasingly popular, including distributed zeroth-order algorithms [21, 32, 34, 35] and stochastic zeroth-order algorithms [33, 36]. Nevertheless, most zeroth-order algorithms, even in centralized settings, are designed for unconstrained optimization problems or depend on projection operators for constraint sets. The projection operations may encounter an undesirable computational burden and even become computationally prohibitive for some latent group Lassos [15], e.g., \(l_{1}\) norm balls and nuclear norm balls.

Consequently, Frank-Wolfe (FW) method [10], aka conditional gradient method, has resurged because of its projection-free and computationally efficient nature. FW method avoids the projection step by accessing a linear minimization (LM) oracle, which can be effectively implemented, especially for some widespread structured constraints (see Table I in [15]). For instance, solving an LM problem over a nuclear norm ball only requires computing a single pair of singular vectors corresponding to the largest singular value, whereas projecting a point onto a nuclear norm ball demands a complete SVD decomposition. Recent years have witnessed extensive research on FW algorithms both in the centralized stochastic setting [2, 12, 18] and distributed deterministic setting [5, 6, 17]. Note that the aforementioned FW algorithms are all designed based on the first-order gradient, which cannot be directly applied to problems with only access to the value of objective functions.

FW method with stochastic zeroth-order oracle (SZO) has been recently investigated in both convex and nonconvex settings. Specifically, [4] put forth zeroth-order stochastic FW algorithms with complexity boundsFootnote 1\(\mathcal{O}(n/\epsilon ^{2})\) and \(\mathcal{O}(n/\epsilon ^{4})\) on SZO for convex and nonconvex cases, respectively. However, the algorithms in [4] require a mini-batch size related to the total number of iterations and the dimension of the problem for guaranteeing convergence. Further, [14] relaxed conditions on batch sizes via the variance reduction technique called SPIDER, and demonstrated that the algorithm achieves a lower complexity bound \(\mathcal{O}(n/\epsilon ^{3})\) on SZO for the nonconvex setting. For the convex case, [3] put forth a stochastic zeroth-order FW method, which only requires a single batch per iteration by using a momentum-based gradient tracking technique, and obtained a complexity bound \(\mathcal{O}(n/\epsilon ^{2})\) on SZO. Subsequently, [22] further extended the centralized stochastic zeroth-order FW methods to the decentralized setting, which depends on a central coordinator, and derived that the proposed algorithm has complexity bounds \(\mathcal{O}(n/\epsilon ^{3})\) and \(\mathcal{O}(n^{\frac{4}{3}}/\epsilon ^{4})\) on SZO for convex and nonconvex cases, respectively. Unfortunately, there are no efficient existing zeroth-order FW works for solving distributed stochastic optimization (DSO) problems in convex or nonconvex settings.

Motivated by the above discussions, this paper dedicates to designing a novel distributed projection-free and gradient-free algorithm for DSO problems. We provide rigorous theoretical analysis on the convergence rate and complexity guarantee of the proposed algorithm, which enjoys a convergence rate comparable to centralized stochastic first-order optimization algorithms [13], filling the theoretical gap of zeroth-order FW methods in DSO problems. Table 1 provides a comparison of the algorithms proposed in the context. The following is the main contributions of our work.

  • We put forth a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW) by using the gradient tracking technique, the momentum-based variance reduction technique, and the coordinate-wise gradient estimation. To our best knowledge, DSZO-FW is the first zeroth-order FW algorithm for DSO problems.

  • We derive sufficient conditions to guarantee the convergence of DSZO-FW under mild conditions. Specifically, DSZO-FW converges only using one batch by introducing the recursive momentum technique [9]. We establish convergence rates of \(\mathcal{O}(k^{-\frac{1}{2}})\) and \(\mathcal{O}(1/\log _{2}(k))\) for the convex and nonconvex case, respectively. The guarantee of the convex case matches the previous best-known result of centralized stochastic optimization methods.

  • For convex objective functions, we prove that DSZO-FW has a function query complexity of \(\mathcal{O}(n/\epsilon ^{2})\) for finding an ϵ-optimal solution, which coincides with that of the existing centralized best results [3, 4], and is even smaller than that of the recent decentralized FW method in [22].

  • For nonconvex objective functions, we derive that DSZO-FW has a function query complexity of \(\mathcal{O}(n(2^{\frac{1}{\epsilon}}))\) for finding an ϵ-stationary point under time-decaying step sizes. In contrast, other works [4, 14, 22] for solving such problems rely on the step sizes related to the total number of iterations.

Table 1 Complexity bounds for Stochastic Frank-Wolfe Optimization method to find an ϵ-optimal or ϵ-stationary point

The remaining is structured as follows. We introduce the problem and the algorithm design in Sect. 2. The convergence performance and theoretical guarantees of the proposed algorithm is presented in Sect. 3. Section 4 takes several simulation experiments to validate the efficacy of the algorithm. Section 5 concludes the work. Appendix provides some technical proofs of the paper.

Notations

The notations used in this paper are fairly standard. Specifically, we denote \(\mathbb{R}\) as a set of real numbers, and \(\mathbb{R}_{+}\) as a set of nonnegative real numbers. Symbols \(\langle \cdot \rangle \) and \(\lceil \cdot \rceil \) denote the inner product and the ceiling operation, respectively. In addition, \(\mathbb{R}^{p}\) is the set of p-dimensional real vectors. Consider a vector \(v\in \mathbb{R}^{p}\). We write \(\|v\|_{q}\) for the \(l_{q}\) norm of v and \(\|v\|\) for the Euclidean norm of v. We write \(\mathbb{E}[\cdot ]\) to denote the expectation operator; moreover, \(\mathbb{E}[\cdot |\mathcal{F}_{k}]\) represents the conditional expectation on the σ-field \(\mathcal{F}_{k}\). Finally, \(W=[w_{ij}]_{N\times N}\) is the weighted adjacency matrix of a topology graph \(\mathcal{G}(\mathcal{N},\mathcal{E})\), where \(\mathcal{N}=\{1,2,\ldots ,N\}\) is a set containing of N agents, and \(\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}\) is a set of edges. For any \(i,j\in \mathcal{N}\), if \((i,j)\in \mathcal{E}\), then \(w_{ij}>0\), otherwise \(w_{ij}=0\).

2 Problem statement and algorithm design

2.1 Problem statement

Consider a set of agents \(\mathcal{N}=\{1,2,\ldots ,N\}\) over an undirected network \(\mathcal{G}=\{\mathcal{N},\mathcal{E}\}\), where \(\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}\) is a set of edges. These agents aim to collaborate to find an optimal solution \(x^{*}\) of the problem

$$\begin{aligned} &\min_{ x\in \mathcal{X}}h(x), h(x):= \frac{1}{N}\sum _{i=1}^{N}\mathbb{E}_{\xi ^{i}} \bigl[h_{i} \bigl(x,\xi ^{i} \bigr) \bigr], \end{aligned}$$
(1)

where \(x\in \mathbb{R}^{n}\) is the strategy variable, and \(\mathcal{X}\subseteq \mathbb{R}^{n}\) is a compact and convex set. The function \(H_{i}(x):=\mathbb{E}_{\xi ^{i}}[h_{i}(x,\xi ^{i})]\) is a local objective function, and \(h_{i}:\mathcal{X}\times \mathbb{R}^{p}\rightarrow \mathbb{R}\) is a function involving random variable \(\xi ^{i}\) with an unknown distribution. The randomness \(\xi ^{i}\) can be viewed as a random sample inserted by algorithms or as measurement noise inherent in systems. Here, we assume that the gradient of the objective function \(H_{i}(\cdot )\) is expensive or infeasible to obtain and agent \(i\in \mathcal{N}\) is only able to access a stochastic approximation of the real objective value \(h_{i}(x,\xi ^{i})\) for any given x and \(\xi ^{i}\).

2.2 Algorithm design

We propose a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW), which is summarized in Algorithm 1. To measure the convergence performance of DSZO-FW, we introduce the following two oracle complexities and a performance measure.

  • Stochastic Zeroth-order Oracle (SZO): SZO returns a function value \(h_{i}(x,\xi ^{i})\) for given \(x\in \mathbb{R}^{n}\) and \(\xi ^{i}\in \mathbb{R}^{p}\).

  • Linear Minimization Oracle (LMO): LMO solves a linear optimization problem, and returns \(\operatorname{argmin}_{\phi \in \mathcal{X}}\langle s,\phi \rangle \) for given direction s and constraint set \(\mathcal{X}\).

  • ϵ-optimal solution: Let \(x^{*}\in \mathcal{X}\) be an optimal solution of problem (1). If \(h(x)-h(x^{*})\leqslant \epsilon \), then \(x\in \mathcal {X}\) is an ϵ-optimal solution of problem (1).

Algorithm 1
figure a

DSZO-FW

Due to the unavailability of the gradient information for objective functions, agent i estimates the gradient \(\nabla h_{i}(x^{i},\xi ^{i})\) by using a coordinate-wise gradient estimator [3, 14]:

$$\begin{aligned} \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr) = \sum_{j=1}^{n} \frac{h_{i}(x^{i}+\rho e_{j},\xi ^{i})-h_{i}(x^{i}-\rho e_{j},\xi ^{i})}{2\rho}e_{j}, \end{aligned}$$
(2)

where \(\rho >0\) denotes the element-wise smoothing parameter, and \(e_{j}\in \mathbb{R}^{n}\) is a standard basis vector with \([e_{j}]_{i}=1\) if \(i=j\), otherwise \([e_{j}]_{i}=0\). We convert the estimator (8) to the following expression at an iteration k in Algorithm 1:

$$\begin{aligned} &\hat{\nabla}h_{i} \bigl(x^{i}_{k}, \xi ^{i}_{k} \bigr) \\ &\quad = \sum_{j=1}^{n} \frac{h_{i}(x^{i}_{k} + \rho _{k} e_{j},\xi ^{i}_{k}) - h_{i}(x^{i}_{k} - \rho _{k} e_{j},\xi ^{i}_{k})}{2\rho _{k}}e_{j}, \end{aligned}$$
(3)

where \(\{\rho _{k}\}_{k=1}^{\infty}\) is a decreasing sequence of positive real numbers.

In Algorithm 1, each agent uses SZO rather than the gradient information and mainly executes four steps. Here, we briefly introduce the process of the ith agent’s kth iteration.

  • Step 1: Agent i takes a weighted average of values from its neighbors on the basis of W, and uses \(\bar{x}^{i}_{k}\) to approximate the average iterate. The specific description is provided in (2).

  • Step 2: Agent i estimates the gradient by using the coordinate-wise gradient estimator (9). To address the non-vanishing variance caused by the gradient estimation, the paper introduces a modified momentum-based variance reduction method, aka recursive momentum [9], into the distributed stochastic Frank-Wolfe (FW) algorithm. The specific expression is described in (3).

  • Step 3: Agent i approximates the global gradient by using the gradient tracking technique, which reuses the global gradient estimation \(y^{i}_{k-1}\) from the previous iteration via (4) and (5).

  • Step 4: To avoid projection operations, agent i updates the iterate by firstly solving a linear minimization problem (6) to obtain a conditional gradient \(z^{i}_{k}\), and then makes a convex combination with the average iterate approximation \(\bar{x}^{i}_{k}\) in (7).

Remark 1

The employment of zeroth-order gradients, also known as derivative-free optimization methods, brings forth both unique challenges and potential advantages. One of the main challenges with zeroth-order methods is their high requirement of function evaluations compared to first-order methods, leading to the gradient variance and higher computational costs. To address this issue, this paper incorporates recursive momentum techniques into a gradient-tracking distributed framework to reduce the non-vanishing variance caused by the gradient estimation. Remarkably, the proposed distributed zeroth-order algorithm can not only attenuate the noise in gradient approximation by only using single batch, but also achieve a comparable function query complexity to the existing centralized best result in convex case. The most significant advantage of using zeroth-order gradients is the ability to optimize functions without the need for gradient information, making it applicable to a wider range of problems where gradients are difficult or impossible to compute.

Remark 2

In Algorithm 1, we introduce the recursive momentum technique into the distributed zeroth-order FW method for reducing the variance caused by gradient estimates, as described in (3). Specifically, we rewrite (3) as

$$\begin{aligned} g^{i}_{k}={}&\beta _{k}\hat{\nabla}h_{i} \bigl(\bar{x}_{k},\xi ^{i}_{k} \bigr)+(1- \beta _{k}) \bigl(\hat{\nabla}h_{i} \bigl(\bar{x}_{k},\xi ^{i}_{k} \bigr) \\ &{}-\hat{\nabla}h_{i} \bigl( \bar{x}_{k-1},\xi ^{i}_{k} \bigr)+g^{i}_{k-1} \bigr). \end{aligned}$$
(4)

The second term \(\hat{\nabla}h_{i}(\bar{x}_{k},\xi ^{i}_{k})-\hat{\nabla}h_{i}(\bar{x}_{k-1}, \xi ^{i}_{k})+g^{i}_{k-1}\) plays an important role in reducing variance caused by the gradient estimation. In addition, the recursive momentum technique allows Algorithm 1 to converge with only one sample at each iteration, unlike the algorithms in [4] and [22], which require large batches. Hence, Algorithm 1 is also well-competent to large-scale finite-sum optimization problems.

Remark 3

In Algorithm 1, the FW step ((6)–(7)) circumvents the projection operation by minimizing a linear optimization subproblem (6) over a constraint set \(\mathcal{X}\). When constraint sets are structural constraints such as nuclear and \(l_{1}\) norm balls, (6) provides an efficient implementation or even a closed-form solution [15], resulting in a cheaper computational cost compared with the projection step. For example, if \(\mathcal{X}\) is an \(l_{1}\) norm ball (\(\mathcal{X}:=\{x|\|x\|_{1} \leqslant d\}\)), the FW step allows for a closed-form solution \(z^{i}_{k}=d\cdot [0,\ldots ,0,-\operatorname{sgn}[s^{i}_{k}]_{h},0,\ldots ,0]^{ \mathrm{T}}\) with \(h=\operatorname{argmax}_{j}|[s^{i}_{k}]_{j}|\) in Algorithm 1. Moreover, when \(\mathcal{X}\) is a nuclear norm ball, solving (6) requires computing only a single pair of singular vectors corresponding to the largest singular value, whereas computing a projection onto \(\mathcal{X}\) demands a complete SVD decomposition.

3 Assumptions and convergence analysis

This section dedicates to analyzing the convergence performance of Algorithm 1. Before providing main results, several standard assumptions are required.

3.1 Assumptions and facts

Assumption 1

The network \(\mathcal{G}\) is connected.

Assumption 2

The weighted adjacency matrix W is doubly stochastic.

Assumptions 1 and 2 indicate that for each round of the Step 1 in Algorithm 1, the agent takes a weighted average of the values from its neighbors according to W. In addition, these assumptions [26] also imply that the matrix W’s second largest eigenvalue λ satisfies \(|\lambda |<1\). The following fact is true under Assumptions 1 and 2 [26].

Fact 1

Let \(\bar{x}=\frac{1}{N} \sum_{i=1}^{N} x^{i}\) and \(\bar{x}^{i}=\sum_{j=1}^{N} w_{ij}x^{j}\). Then, \((\sum_{i=1}^{N} \|\bar{x}^{i} - \bar{x}\|^{2} )^{ \frac{1}{2}}\leqslant |\lambda | ( \sum_{i=1}^{N} \|x^{i} - \bar{x}\|^{2} )^{\frac{1}{2}}\).

Fact 1 suggests that each update in the average consensus process (Step 1) incrementally aligns the iteration variables more closely with their mean value . To streamline our convergence analysis, we introduce \(k_{0}\in \mathbb{R}_{+}\) as the smallest integer such that \(|\lambda |\leqslant [k_{0}/(k_{0}+1)]^{2}\). Clearly, \(k_{0}=\lceil (|\lambda |^{-\frac{1}{2}}-1)^{-1}\rceil \).

Assumption 3

\(H_{i}(\cdot )\) and \(h_{i}(\cdot ,\xi ^{i})\) are L-smooth functions on the constraint set \(\mathcal{X}\) for all \(i\in \mathcal{N}\) and \(\xi ^{i}\in \mathbb{R}^{p}\).

Furthermore, we posit an additional assumption regarding the constraint set \(\mathcal{X}\), which forms a foundational element in the context of FW-based methods [3, 4, 14, 22].

Assumption 4

\(\mathcal{X}\) is compact and convex, that is, \(\|x-y\|\leqslant d\) for all \(x,y\in \mathcal{X}\), where d is a positive constant.

Assumption 5

The variance of \(\nabla h_{i}(x,\xi ^{i})\) is bounded for all \(x\in \mathcal{X}\) and \(i\in \mathcal{N}\). That is, there exists a constant δ such that \(\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})-\nabla H_{i}(x)\|^{2}] \leqslant \delta ^{2}\), where \(H_{i}(x)=\mathbb{E}[h_{i}(x,\xi ^{i})]\).

Fact 2

(see [13])

If Assumptions 45hold, there is a positive constant l such that \(\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})\|^{2}]\leqslant l^{2}\) and \(\mathbb{E}[\|\nabla h_{i}(x, \xi ^{i})\|]\leqslant l\).

Assumptions 35 are standard assumptions in stochastic FW methods [3, 4, 9, 13, 14, 22]. If Assumption 3 holds, the following fact is true.

Fact 3

Define \(\hat{\nabla}H_{i}(x^{i}):=\sum^{n}_{j=1} \frac{H_{i}(x^{i}+\rho e_{j})-H_{i}(x^{i}-\rho e_{j})}{2\rho}e_{j}= \mathbb{E}[{\hat{\nabla}h_{i}(x^{i},\xi ^{i})}]\), where \(\hat{\nabla}h_{i}(x^{i},\xi ^{i})\) defined in (8). Then, for any \(x^{i}\in \mathcal{X}\) (\(i\in \mathcal{N}\)) and \(\xi ^{i}\in \mathbb{R}^{p}\),

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i}, \xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \leqslant nL^{2}\rho ^{2}, \end{aligned}$$
(5)
$$\begin{aligned} & \bigl\Vert \hat{\nabla}H_{i} \bigl(x^{i} \bigr)-\nabla H_{i} \bigl(x^{i} \bigr) \bigr\Vert ^{2}\leqslant nL^{2} \rho ^{2}. \end{aligned}$$
(6)

Proof

We first prove (11). It follows from the definition of \(\hat{\nabla}h_{i}(x^{i},\xi ^{i})\) and the mean value theorem to \(\nabla h_{i}(x^{i},\xi ^{i})\) that there exists \(\alpha _{j}\in (0,1)\) such that

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad = \Biggl\Vert \sum_{j=1}^{n} \frac{h_{i}(x^{i}+\rho e_{j},\xi ^{i})-h_{i}(x^{i}-\rho e_{j},\xi ^{i})}{2\rho}e_{j} \\ &\qquad {}- \nabla h_{i} \bigl(x^{i}, \xi ^{i} \bigr) \Biggr\Vert ^{2} \\ &\quad = \Biggl\Vert \frac{1}{2\rho}\sum_{j=1}^{n} \bigl(2\rho e_{j}e^{\mathrm{T}}_{j} \nabla h_{i} \bigl(x^{i}+(2\alpha _{j}-1)\rho e_{j},\xi ^{i} \bigr) \bigr) \\ &\qquad {}-\nabla h_{i} \bigl(x^{i}, \xi ^{i} \bigr) \Biggr\Vert ^{2}. \end{aligned}$$

It follows from the property of the basis vector \(e_{j}\) and Euclidean norm that

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad =\sum_{j=1}^{n} \bigl\Vert e_{j}e^{\mathrm{T}}_{j} \bigl(\nabla h_{i} \bigl(x^{i}+(2 \alpha _{j}-1)\rho e_{j},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr) \bigr\Vert ^{2} \\ &\quad \leqslant \sum_{j=1}^{n} \bigl\Vert \nabla h_{i} \bigl(x^{i}+(2\alpha _{j}-1) \rho e_{j},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad \leqslant L^{2}\sum_{j=1}^{n} \bigl\Vert (2\alpha _{j}-1)\rho e_{j} \bigr\Vert ^{2} \\ &\quad \leqslant nL^{2}\rho ^{2}, \end{aligned}$$

where we use Assumption 3 in the second inequality. We obtain Eqn. (12) in a similar way. □

Fact 4

(see [13])

For any vectors \(v_{1},\ldots , v_{N}\in \mathbb{R}^{n}\),

$$\begin{aligned} \Vert v_{1} + \cdots + v_{N} \Vert ^{2}\leqslant N \bigl( \Vert v_{1} \Vert ^{2} + \cdots + \Vert v_{N} \Vert ^{2} \bigr). \end{aligned}$$
(7)

Assumptions 15 and Facts 24 are crucial to the subsequent analysis. They serve as the theoretical groundwork upon which our analysis is constructed, ensuring a rigorous foundation for the methodologies employed and the conclusions drawn.

3.2 Convergence analysis

For the convenience of analysis, we define

$$\begin{aligned}& \bar{x}_{k}:=\frac{1}{N}\sum_{i=1}^{N}x^{i}_{k},\qquad \bar{g}_{k}:= \frac{1}{N}\sum_{i=1}^{N}g^{i}_{k}, \\& \bar{p}_{k}:=\frac{1}{N}\sum_{i=1}^{N} \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr). \end{aligned}$$

The following lemma estimates the tracking error for the average iterate in Algorithm 1, and we provide the proof in Appendix 1.2.

Lemma 1

Let \(\gamma _{k}=\frac{2}{k+2}\). If Assumptions 1, 2and 4hold, then, for any \(i\in \mathcal{N}\) and \(k\geqslant 1\), \(\|\bar{x}^{i}_{k}-\bar{x}_{k}\|\leqslant \frac{2C_{1}}{k+2}\) and \(\|\bar{x}^{i}_{k+1}-\bar{x}^{i}_{k}\|\leqslant \frac{2(d+2C_{1})}{k+2}\), where \(C_{1}\) is defined in Table 2and \(k\geqslant 1\).

Table 2 The nomenclature of values employed in this article

Lemma 1 shows that the averaged iterate estimation \(\bar{x}^{i}_{k}\) approximates to the real average value \(\bar{x}_{k}\) at a rate of \(\mathcal{O}(1/k)\).

We provide the performance of the averaged gradient tracking for Algorithm 1 in the following lemma. Appendix 1.4 presents the proof of Lemma 2.

Lemma 2

Suppose Assumptions 15hold. If \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\) and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), then

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert \bar{g}_{k}-s^{i}_{k} \bigr\Vert ^{2} \bigr]\leqslant \frac{4C_{2}}{(k+2)^{2}}, \end{aligned}$$
(8)

where \(C_{2}\) is defined in Table 2and \(k\geqslant 1\).

Lemma 2 establishes that \(\mathbb{E}[\|\bar{g}_{k}-s^{i}_{k}\|^{2}]=\mathcal{O}(1/k^{2})\), which implies that \(\|\bar{g}_{k}-s^{i}_{k}\|\) converges to zero as \(k\rightarrow +\infty \) in expectation.

The following lemma plays an important role in the convergence analysis of Algorithm 1.

Lemma 3

Define \(\hat{\nabla}\bar{h}_{k}:=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{k}[ \hat{\nabla}h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})]\). If Assumptions 15hold, the following two relations are established.

1) For any \(k\geqslant 1\), it holds that

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert \bar{g}_{k}-\hat{\nabla}\bar{h}_{k} \Vert ^{2} \bigr] \\ &\quad \leqslant (1- \beta _{k})^{2}\mathbb{E} \bigl[ \Vert \bar{g}_{k-1}-\hat{\nabla}\bar{h}_{k-1} \Vert ^{2} \bigr]+60nL^{2} \rho ^{2}_{k-1} \\ &\qquad {} +6\delta ^{2}\beta _{k}^{2}+24L^{2} \gamma ^{2}_{k-1}(d+2C_{1})^{2}. \end{aligned}$$
(9)

2) If \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), then for any \(k\geqslant 1\),

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert \bar{g}_{k}-\bar{p}_{k} \Vert ^{2} \bigr]\leqslant \frac{2C_{3}+2L^{2}(d+2C_{1})^{2}}{k+2}, \end{aligned}$$
(10)

where \(C_{3}\) and \(C_{1}\) are defined in Table 2.

The proof of Lemma 3 is provided in Appendix 1.5.

Lemma 3 shows that the variable \(\bar{g}_{k}\) tracks the real average gradient \(\bar{p}_{k}\) with an average error bounded by \(\mathcal{O}(\frac{C_{3}+L^{2}(d+C_{1})^{2}}{k+2})\). That is, the expected error of the approximation in stochastic gradient diminishes as the number of iterations increases. Making use of Lemmas 2 and 3, the following lemma is established.

Lemma 4

Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). If Assumptions 15hold, then, for any \(k\geqslant 1\) and \(i\in \mathcal{N}\),

$$\begin{aligned} &\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant \frac{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}}{k+2}. \end{aligned}$$
(11)

The proof is presented in Appendix 1.6.

The following two theorems establish convergence rates of Algorithm 1 for convex and nonconvex objectives, respectively.

Theorem 1

(Convex objective) Let Assumptions 15hold. Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). If \(h_{i}(\cdot ,\xi ^{i})\) is convex for any \(i\in \mathcal{N}\) and \(\xi ^{i}\), then

$$\begin{aligned} \mathbb{E} \bigl[h(\bar{x}_{k+1}) \bigr]-h \bigl(x^{*} \bigr) \leqslant \frac{C_{4}}{(k+3)^{\frac{1}{2}}},\quad \forall k\geqslant 1, \end{aligned}$$

where \(C_{4}\) is defined in Table 2.

The proof of Theorem 1 is presented in Appendix 1.7.

Theorem 1 indicates that the convergence rate of Algorithm 1 is \(\mathcal{O}(1/k^{\frac{1}{2}})\). The result can be directly translated into finding an ϵ-optimal solution to problem (1). The numbers of calls to SZO and LMO for ϵ-optimal solutions are \(\mathcal{O}(\frac{nC^{2}_{4}}{\epsilon ^{2}})\) and \(\mathcal{O}(\frac{C^{2}_{4}}{\epsilon ^{2}})\), respectively.

For the nonconvex case, we introduce a convergence criterion used for standard FW methods, aka FW-gap [4, 13, 14, 22], which is

$$\begin{aligned} p_{k}=\max_{x\in \mathcal{X}} \bigl\langle \nabla h( \bar{x}_{k}),\bar{x}_{k}-x \bigr\rangle . \end{aligned}$$
(12)

Based on the convergence measure (18), we establish the following theorem for problem (1) with nonconvex objective functions.

Theorem 2

(Nonconvex objective) Suppose Assumptions 15hold. Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). Then,

$$\begin{aligned} \mathbb{E} \Bigl[\min_{k\in \{1,\ldots ,K\}}p_{k} \Bigr]\leqslant{}& \frac{1}{\log _{2}(K)-1} \bigl(h(\bar{x}_{1})-h( \bar{x}_{K+1})+4Ld \\ &{} +c\sqrt{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}} \bigr), \end{aligned}$$

where \(c\in \mathbb{R}\) satisfies \(\sum_{k=1}^{2^{m}}(4d/(k+2)^{\frac{3}{2}})\leqslant c\).

The proof of Theorem 2 is presented in Appendix 1.8.

Theorem 2 shows that Algorithm 1 converges to a stationary point at a rate of \(\mathcal{O}(1/\log _{2}(K))\) when the objective function is nonconvex. The total number of calls to SZO and LMO are \(\mathcal{O}(2^{\frac{\Gamma}{\epsilon}}d)\) and \(\mathcal{O}(2^{\frac{\Gamma}{\epsilon}})\) for finding an ϵ-stationary point, respectively.

Remark 4

Table 1 shows that both the number of calls and the function query-size to SZO of Algorithm 1 are significantly less than those in ZSCG and ZSAGMIU [4], at the cost of a larger complexity bound on LMO. In addition, Algorithm 1 has the same complexity bounds for both SZO and LMO as those in the recently proposed centralized method MOST-FW [3]. Compared with the existing decentralized zeroth-order FW method DSGFF [22], which requires a central coordinator, the fully distributed Algorithm 1 has a lower complexity bound of SZO in the convex case and a weaker dimensional dependency of SZO in the nonconvex case.

Remark 5

It is worth noting that the step sizes we use are monotone decreasing, different from the existing zeroth-order nonconvex FW methods [4, 14, 22]. The step sizes mentioned in these references depend on the total iteration number K and the dimension of the variable.

4 Numerical simulations

In this section, we apply Algorithm 1 (DSZO-FW) to solve a black-box distributed stochastic binary classification problem with convex and nonconvex objectives, respectively. To solve such problems, DSZO-FW is applied over a connected network \(\mathcal{G}\) with \(N=5\) agents and a doubly stochastic adjacency matrix W. The communication graph is a ring topology, and each agent only accesses its own objective function \(h_{i}\). We construct matrix W by using maximum-degree weights. Specifically, the maximum degree of ring topology is \(d_{\max}=2\). For any edge \((i,j)\) in the graph, the weight \(w_{ij}\) is set as \(w_{ij}=1/(1+d_{\max})\) for all \(i\neq j\). The diagonal elements \(w_{ii}\) are then set to make the rows sum up to 1, which typically results in \(w_{ii}=1-\sum_{j\in \mathcal{N}_{i}}w_{ij}\), where \(\mathcal{N}_{i}\) denotes the set of neighbors of node i. We set the constraint set to an \(l_{1}\)-norm ball such that \(\mathcal{X}=\{x|\|x\|_{1}\leqslant d\}\). Here we assume \(d=5\).

For better evaluating the performance of DSZO-FW, we compare it against centralized algorithms ZSCG [4], SGFFW [23], and MOST-FW [3] as baselines. In the experiments, we use three public datasetsFootnote 2 (covtype.binary, a9a and w8a) and suppose that each iteration randomly obtains only 1% of data. Because a large batch size \(m_{k}\) (related to the dimension and the total number of iterations) required by ZSCG exceeds the total number of samples in these three datasets, we regard ZSCG as a deterministic algorithm in the experiment, which uses full data to compute the function value. We evaluate these four algorithms according to the FW-gap, which is defined in (18).

4.1 Black-box binary classification with convex objectives

This subsection dedicates to verifying the theoretical results of DSZO-FW in the convex case. Our goal is to find an optimal solution \(x\in \mathbb{R}^{n}\) by solving the following stochastic binary classification problem:

$$\begin{aligned} &\min_{x\in \mathcal{X}}h(x),\quad h(x):= \frac{1}{N}\sum _{i=1}^{N}h_{i}(x), \\ &h_{i}(x):=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}} \mathbb{E}_{a_{ij},b_{ij}} \bigl[\mathrm{ln} \bigl(1+\operatorname{exp} \bigl(-b_{ij}\langle a_{ij},x\rangle \bigr) \bigr) \bigr], \end{aligned}$$

where \((a_{ij},b_{ij})^{m_{i}}_{j=1}\) are \(m_{i}\) (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the convex setting, i.e., \(\alpha _{k}=6/(k+5)\) for ZSCG [4]; \(\rho _{k}=4/(k+8)^{\frac{2}{3}}\), \(\gamma _{k}=2/(k+8)\) and \(c_{k}=2/(n^{\frac{1}{2}}(k+8)^{\frac{1}{3}})\) for SGFFW [23]; \(\gamma _{k}=1/k\), \(\eta _{k}=2/(k+1)\), \(\mu _{k}=0\) and \(\rho _{k}=d/\sqrt{n}(k+1)\) for MOST-FW [3]; \(\beta _{k}=2/(k+1)\), \(\gamma _{k}=2/(k+2)\) and \(\rho _{k}=d/\sqrt{n}(k+2)\) for DSZO-FW.

Figure 1 shows the convergence performance of these four algorithms on a convex binary classification problem. We observe that DSZO-FW and MOST-FW perform a smaller FW-gap than ZSCG and SGFFW, especially on dataset \(w8a\), although they use less data than ZSCG. This dedicates that the local gradient estimate via the recursive momentum technique might be a better candidate for approximating the gradient. We observe the periodic vibrate on the curves of these four algorithms, especially on datasets \(a9a\) and \(w8a\). We intuitively believe that this phenomenon occurs due to the imprecise estimation of the gradient estimator and the gradient variance reduced period via the variance reduction technique.

Figure 1
figure 1

The comparison between ZSCG, MOST-FW, SGFFW and DSZO-FW on convex black-box binary classification. (a) a9a (b) w8a (c) covtype.binary

4.2 Black-box binary classification with nonconvex objectives

In this subsection, we dedicate to verifying the theoretical results of DSZO-FW in the nonconvex case. Consider the following stochastic binary classification problem with nonconvex objective functions:

$$\begin{aligned} &\min_{x\in \mathcal{X}}h(x),\quad h(x):= \frac{1}{N}\sum _{i=1}^{N}h_{i}(x), \\ &h_{i}(x):=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}} \mathbb{E}_{a_{ij},b_{ij}} \biggl[\frac{1}{1+\operatorname{exp}(b_{ij}\langle a_{ij},x\rangle )} \biggr], \end{aligned}$$

where \((a_{ij},b_{ij})^{m_{i}}_{j=1}\) are \(m_{i}\) (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the nonconvex setting, i.e., \(\alpha _{k}=1/T^{\frac{1}{2}}\) for ZSCG [4]; \(\gamma _{k}=1/T^{\frac{3}{4}}\), \(\rho _{k}=4/((k+8)^{\frac{2}{3}}(1+n)^{\frac{1}{3}})\), and \(c_{k}=2/(n^{\frac{3}{2}}(k+8)^{\frac{1}{3}})\) for SGFFW [23]; \(\gamma _{k}=1/k\), \(\eta _{k}=2/(k+1)\), \(\mu _{k}=0\), and \(\rho _{k}=d/\sqrt{n}(k+1)\) for MOST-FW [3]; \(\beta _{k}=2/(k+1)\), \(\rho _{k}=d/\sqrt{n}(k+2)\), and \(\gamma _{k}=2/(k+2)\) for DSZO-FW. Note that MOST-FW is not proven to be convergent for the nonconvex case. We implement the algorithm only for comparison purposes.

Figure 2 shows the convergence performance measured by FW-gap of these four algorithms on a nonconvex binary classification problem. The results show that DSZO-FW converges faster than ZSCG and SGFFW in both three datasets. In contrast, DSZO-FW has a comparable convergence performance to MOST-FW on datasets \(a9a\) and \(w8a\), demonstrating the efficacy of the variance reduction technique used in DSZO-FW and MOST-FW. Similar to Fig. 1, the periodic vibrate on the curves of these four algorithms also appears, especially on datasets \(a9a\) and \(w8a\). We infer that this phenomenon occurs because the variance of the gradient estimator is too high in these two cases.

Figure 2
figure 2

The comparison between ZSCG, MOST-FW, SGFFW and DSZO-FW on nonconvex black-box binary classification. (a) a9a (b) w8a (c) covtype.binary

5 Conclusions

This paper proposed a novel algorithm in a projection-free and gradient-free manner for distributed stochastic optimization problems accessing only the stochastic zeroth-order oracle (SZO). The proposed algorithm only requires a single batch size to guarantee convergence using recursive momentum and gradient tracking techniques. We proved that the proposed algorithm has the comparable complexity bound \(\mathcal{O}(n/\epsilon ^{2})\) on SZO as that of the centralized best results for the convex case. For the nonconvex case, the algorithm has a complexity bound \(\mathcal{O}(n/(2^{\frac{1}{\epsilon}}))\) on SZO under mild conditions. The efficacy of the proposed algorithm is demonstrated through simulation experiments on multiple datasets. Our future works include extending the algorithm to stochastic nonsmooth optimization problems and introducing variance reduction techniques to obtain a better convergence performance.