Distributed gradient-free and projection-free algorithm for stochastic constrained optimization

Hou, Jie; Zeng, Xianlin; Chen, Chen

doi:10.1007/s43684-024-00062-0

Distributed gradient-free and projection-free algorithm for stochastic constrained optimization

Original Article
Open access
Published: 01 May 2024

Volume 4, article number 6, (2024)
Cite this article

Download PDF

You have full access to this open access article

Autonomous Intelligent Systems Aims and scope Submit manuscript

Distributed gradient-free and projection-free algorithm for stochastic constrained optimization

Download PDF

457 Accesses
Explore all metrics

Abstract

Distributed stochastic zeroth-order optimization (DSZO), in which the objective function is allocated over multiple agents and the derivative of cost functions is unavailable, arises frequently in large-scale machine learning and reinforcement learning. This paper introduces a distributed stochastic algorithm for DSZO in a projection-free and gradient-free manner via the Frank-Wolfe framework and the stochastic zeroth-order oracle (SZO). Such a scheme is particularly useful in large-scale constrained optimization problems where calculating gradients or projection operators is impractical, costly, or when the objective function is not differentiable everywhere. Specifically, the proposed algorithm, enhanced by recursive momentum and gradient tracking techniques, guarantees convergence with just a single batch per iteration. This significant improvement over existing algorithms substantially lowers the computational complexity. Under mild conditions, we prove that the complexity bounds on SZO of the proposed algorithm are $\mathcal{O}(n/\epsilon ^{2})$ and $\mathcal{O}(n(2^{\frac{1}{\epsilon}}))$ for convex and nonconvex cases, respectively. The efficacy of the algorithm is verified on black-box binary classification problems against several competing alternatives.

An Adaptive Stochastic Gradient-Free Approach for High-Dimensional Blackbox Optimization

Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case

Article 12 February 2017

A distributed gradient algorithm based on randomized block-coordinate and projection-free over networks

Article Open access 28 June 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, distributed optimization has received a surge of interest in diverse areas, including autonomous vehicle control [16], multi-agent systems [31] and sensor networks [1], due to its significant advantages in aspects of data privacy, robustness, flexibility, and scalability. Distributed optimization minimizes a joint function through local computation and communication between agents in a network. Recently, much effort has been dedicated to the distributed stochastic setting [11, 19, 29, 30], where each agent’s objective function is the expectation of a function with random variables that follow unknown distributions. Such situation widely exists in the machine learning [5, 19], multi-agent reinforcement learning [25, 27, 28], and unmanned systems [7, 31], to name a few. Most distributed algorithms for solving such problems require the explicit gradients of objective functions. However, the feedback available to agents is incomplete or noisy because of the environmental uncertainty in many practical applications. Hence, the real gradient feedback seems too strict in reality.

Zeroth-order optimization is a typical gradient-free method that has gained widespread concern due to its wide usage in many practical large-scale optimization tasks. In these tasks, the explicit gradient of the objective function is expensive or unavailable to obtain, and only function evaluations are accessible. For instance, the objective function of many big data problems in complex data generation processes cannot be clearly defined. Such situations include large-scale black-box adversarial attacks to deep networks [8], simulation-based modeling [20], and reinforcement learning [24], etc. Motivated by these applications, the design and analysis of zeroth-order algorithms become increasingly popular, including distributed zeroth-order algorithms [21, 32, 34, 35] and stochastic zeroth-order algorithms [33, 36]. Nevertheless, most zeroth-order algorithms, even in centralized settings, are designed for unconstrained optimization problems or depend on projection operators for constraint sets. The projection operations may encounter an undesirable computational burden and even become computationally prohibitive for some latent group Lassos [15], e.g., $l_{1}$ norm balls and nuclear norm balls.

Consequently, Frank-Wolfe (FW) method [10], aka conditional gradient method, has resurged because of its projection-free and computationally efficient nature. FW method avoids the projection step by accessing a linear minimization (LM) oracle, which can be effectively implemented, especially for some widespread structured constraints (see Table I in [15]). For instance, solving an LM problem over a nuclear norm ball only requires computing a single pair of singular vectors corresponding to the largest singular value, whereas projecting a point onto a nuclear norm ball demands a complete SVD decomposition. Recent years have witnessed extensive research on FW algorithms both in the centralized stochastic setting [2, 12, 18] and distributed deterministic setting [5, 6, 17]. Note that the aforementioned FW algorithms are all designed based on the first-order gradient, which cannot be directly applied to problems with only access to the value of objective functions.

FW method with stochastic zeroth-order oracle (SZO) has been recently investigated in both convex and nonconvex settings. Specifically, [4] put forth zeroth-order stochastic FW algorithms with complexity bounds^{Footnote 1}$\mathcal{O}(n/\epsilon ^{2})$ and $\mathcal{O}(n/\epsilon ^{4})$ on SZO for convex and nonconvex cases, respectively. However, the algorithms in [4] require a mini-batch size related to the total number of iterations and the dimension of the problem for guaranteeing convergence. Further, [14] relaxed conditions on batch sizes via the variance reduction technique called SPIDER, and demonstrated that the algorithm achieves a lower complexity bound $\mathcal{O}(n/\epsilon ^{3})$ on SZO for the nonconvex setting. For the convex case, [3] put forth a stochastic zeroth-order FW method, which only requires a single batch per iteration by using a momentum-based gradient tracking technique, and obtained a complexity bound $\mathcal{O}(n/\epsilon ^{2})$ on SZO. Subsequently, [22] further extended the centralized stochastic zeroth-order FW methods to the decentralized setting, which depends on a central coordinator, and derived that the proposed algorithm has complexity bounds $\mathcal{O}(n/\epsilon ^{3})$ and $\mathcal{O}(n^{\frac{4}{3}}/\epsilon ^{4})$ on SZO for convex and nonconvex cases, respectively. Unfortunately, there are no efficient existing zeroth-order FW works for solving distributed stochastic optimization (DSO) problems in convex or nonconvex settings.

Motivated by the above discussions, this paper dedicates to designing a novel distributed projection-free and gradient-free algorithm for DSO problems. We provide rigorous theoretical analysis on the convergence rate and complexity guarantee of the proposed algorithm, which enjoys a convergence rate comparable to centralized stochastic first-order optimization algorithms [13], filling the theoretical gap of zeroth-order FW methods in DSO problems. Table 1 provides a comparison of the algorithms proposed in the context. The following is the main contributions of our work.

We put forth a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW) by using the gradient tracking technique, the momentum-based variance reduction technique, and the coordinate-wise gradient estimation. To our best knowledge, DSZO-FW is the first zeroth-order FW algorithm for DSO problems.
We derive sufficient conditions to guarantee the convergence of DSZO-FW under mild conditions. Specifically, DSZO-FW converges only using one batch by introducing the recursive momentum technique [9]. We establish convergence rates of $\mathcal{O}(k^{-\frac{1}{2}})$ and $\mathcal{O}(1/\log _{2}(k))$ for the convex and nonconvex case, respectively. The guarantee of the convex case matches the previous best-known result of centralized stochastic optimization methods.
For convex objective functions, we prove that DSZO-FW has a function query complexity of $\mathcal{O}(n/\epsilon ^{2})$ for finding an ϵ-optimal solution, which coincides with that of the existing centralized best results [3, 4], and is even smaller than that of the recent decentralized FW method in [22].
For nonconvex objective functions, we derive that DSZO-FW has a function query complexity of $\mathcal{O}(n(2^{\frac{1}{\epsilon}}))$ for finding an ϵ-stationary point under time-decaying step sizes. In contrast, other works [4, 14, 22] for solving such problems rely on the step sizes related to the total number of iterations.

Table 1 Complexity bounds for Stochastic Frank-Wolfe Optimization method to find an ϵ-optimal or ϵ-stationary point

Full size table

The remaining is structured as follows. We introduce the problem and the algorithm design in Sect. 2. The convergence performance and theoretical guarantees of the proposed algorithm is presented in Sect. 3. Section 4 takes several simulation experiments to validate the efficacy of the algorithm. Section 5 concludes the work. Appendix provides some technical proofs of the paper.

Notations

The notations used in this paper are fairly standard. Specifically, we denote $\mathbb{R}$ as a set of real numbers, and $\mathbb{R}_{+}$ as a set of nonnegative real numbers. Symbols $\langle \cdot \rangle $ and $\lceil \cdot \rceil $ denote the inner product and the ceiling operation, respectively. In addition, $\mathbb{R}^{p}$ is the set of p-dimensional real vectors. Consider a vector $v\in \mathbb{R}^{p}$. We write $\|v\|_{q}$ for the $l_{q}$ norm of v and $\|v\|$ for the Euclidean norm of v. We write $\mathbb{E}[\cdot ]$ to denote the expectation operator; moreover, $\mathbb{E}[\cdot |\mathcal{F}_{k}]$ represents the conditional expectation on the σ-field $\mathcal{F}_{k}$. Finally, $W=[w_{ij}]_{N\times N}$ is the weighted adjacency matrix of a topology graph $\mathcal{G}(\mathcal{N},\mathcal{E})$, where $\mathcal{N}=\{1,2,\ldots ,N\}$ is a set containing of N agents, and $\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}$ is a set of edges. For any $i,j\in \mathcal{N}$, if $(i,j)\in \mathcal{E}$, then $w_{ij}>0$, otherwise $w_{ij}=0$.

2 Problem statement and algorithm design

2.1 Problem statement

Consider a set of agents $\mathcal{N}=\{1,2,\ldots ,N\}$ over an undirected network $\mathcal{G}=\{\mathcal{N},\mathcal{E}\}$, where $\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}$ is a set of edges. These agents aim to collaborate to find an optimal solution $x^{*}$ of the problem

$$\begin{aligned} &\min_{ x\in \mathcal{X}}h(x), h(x):= \frac{1}{N}\sum _{i=1}^{N}\mathbb{E}_{\xi ^{i}} \bigl[h_{i} \bigl(x,\xi ^{i} \bigr) \bigr], \end{aligned}$$

(1)

where $x\in \mathbb{R}^{n}$ is the strategy variable, and $\mathcal{X}\subseteq \mathbb{R}^{n}$ is a compact and convex set. The function $H_{i}(x):=\mathbb{E}_{\xi ^{i}}[h_{i}(x,\xi ^{i})]$ is a local objective function, and $h_{i}:\mathcal{X}\times \mathbb{R}^{p}\rightarrow \mathbb{R}$ is a function involving random variable $\xi ^{i}$ with an unknown distribution. The randomness $\xi ^{i}$ can be viewed as a random sample inserted by algorithms or as measurement noise inherent in systems. Here, we assume that the gradient of the objective function $H_{i}(\cdot )$ is expensive or infeasible to obtain and agent $i\in \mathcal{N}$ is only able to access a stochastic approximation of the real objective value $h_{i}(x,\xi ^{i})$ for any given x and $\xi ^{i}$.

2.2 Algorithm design

We propose a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW), which is summarized in Algorithm 1. To measure the convergence performance of DSZO-FW, we introduce the following two oracle complexities and a performance measure.

Stochastic Zeroth-order Oracle (SZO): SZO returns a function value $h_{i}(x,\xi ^{i})$ for given $x\in \mathbb{R}^{n}$ and $\xi ^{i}\in \mathbb{R}^{p}$.
Linear Minimization Oracle (LMO): LMO solves a linear optimization problem, and returns $\operatorname{argmin}_{\phi \in \mathcal{X}}\langle s,\phi \rangle $ for given direction s and constraint set $\mathcal{X}$.
ϵ-optimal solution: Let $x^{*}\in \mathcal{X}$ be an optimal solution of problem (1). If $h(x)-h(x^{*})\leqslant \epsilon $, then $x\in \mathcal {X}$ is an ϵ-optimal solution of problem (1).

Due to the unavailability of the gradient information for objective functions, agent i estimates the gradient $\nabla h_{i}(x^{i},\xi ^{i})$ by using a coordinate-wise gradient estimator [3, 14]:

$$\begin{aligned} \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr) = \sum_{j=1}^{n} \frac{h_{i}(x^{i}+\rho e_{j},\xi ^{i})-h_{i}(x^{i}-\rho e_{j},\xi ^{i})}{2\rho}e_{j}, \end{aligned}$$

(2)

where $\rho >0$ denotes the element-wise smoothing parameter, and $e_{j}\in \mathbb{R}^{n}$ is a standard basis vector with $[e_{j}]_{i}=1$ if $i=j$, otherwise $[e_{j}]_{i}=0$. We convert the estimator (8) to the following expression at an iteration k in Algorithm 1:

$$\begin{aligned} &\hat{\nabla}h_{i} \bigl(x^{i}_{k}, \xi ^{i}_{k} \bigr) \\ &\quad = \sum_{j=1}^{n} \frac{h_{i}(x^{i}_{k} + \rho _{k} e_{j},\xi ^{i}_{k}) - h_{i}(x^{i}_{k} - \rho _{k} e_{j},\xi ^{i}_{k})}{2\rho _{k}}e_{j}, \end{aligned}$$

(3)

where $\{\rho _{k}\}_{k=1}^{\infty}$ is a decreasing sequence of positive real numbers.

In Algorithm 1, each agent uses SZO rather than the gradient information and mainly executes four steps. Here, we briefly introduce the process of the ith agent’s kth iteration.

Step 1: Agent i takes a weighted average of values from its neighbors on the basis of W, and uses $\bar{x}^{i}_{k}$ to approximate the average iterate. The specific description is provided in (2).
Step 2: Agent i estimates the gradient by using the coordinate-wise gradient estimator (9). To address the non-vanishing variance caused by the gradient estimation, the paper introduces a modified momentum-based variance reduction method, aka recursive momentum [9], into the distributed stochastic Frank-Wolfe (FW) algorithm. The specific expression is described in (3).
Step 3: Agent i approximates the global gradient by using the gradient tracking technique, which reuses the global gradient estimation $y^{i}_{k-1}$ from the previous iteration via (4) and (5).
Step 4: To avoid projection operations, agent i updates the iterate by firstly solving a linear minimization problem (6) to obtain a conditional gradient $z^{i}_{k}$, and then makes a convex combination with the average iterate approximation $\bar{x}^{i}_{k}$ in (7).

Remark 1

The employment of zeroth-order gradients, also known as derivative-free optimization methods, brings forth both unique challenges and potential advantages. One of the main challenges with zeroth-order methods is their high requirement of function evaluations compared to first-order methods, leading to the gradient variance and higher computational costs. To address this issue, this paper incorporates recursive momentum techniques into a gradient-tracking distributed framework to reduce the non-vanishing variance caused by the gradient estimation. Remarkably, the proposed distributed zeroth-order algorithm can not only attenuate the noise in gradient approximation by only using single batch, but also achieve a comparable function query complexity to the existing centralized best result in convex case. The most significant advantage of using zeroth-order gradients is the ability to optimize functions without the need for gradient information, making it applicable to a wider range of problems where gradients are difficult or impossible to compute.

Remark 2

In Algorithm 1, we introduce the recursive momentum technique into the distributed zeroth-order FW method for reducing the variance caused by gradient estimates, as described in (3). Specifically, we rewrite (3) as

$$\begin{aligned} g^{i}_{k}={}&\beta _{k}\hat{\nabla}h_{i} \bigl(\bar{x}_{k},\xi ^{i}_{k} \bigr)+(1- \beta _{k}) \bigl(\hat{\nabla}h_{i} \bigl(\bar{x}_{k},\xi ^{i}_{k} \bigr) \\ &{}-\hat{\nabla}h_{i} \bigl( \bar{x}_{k-1},\xi ^{i}_{k} \bigr)+g^{i}_{k-1} \bigr). \end{aligned}$$

(4)

The second term $\hat{\nabla}h_{i}(\bar{x}_{k},\xi ^{i}_{k})-\hat{\nabla}h_{i}(\bar{x}_{k-1}, \xi ^{i}_{k})+g^{i}_{k-1}$ plays an important role in reducing variance caused by the gradient estimation. In addition, the recursive momentum technique allows Algorithm 1 to converge with only one sample at each iteration, unlike the algorithms in [4] and [22], which require large batches. Hence, Algorithm 1 is also well-competent to large-scale finite-sum optimization problems.

Remark 3

In Algorithm 1, the FW step ((6)–(7)) circumvents the projection operation by minimizing a linear optimization subproblem (6) over a constraint set $\mathcal{X}$. When constraint sets are structural constraints such as nuclear and $l_{1}$ norm balls, (6) provides an efficient implementation or even a closed-form solution [15], resulting in a cheaper computational cost compared with the projection step. For example, if $\mathcal{X}$ is an $l_{1}$ norm ball ($\mathcal{X}:=\{x|\|x\|_{1} \leqslant d\}$), the FW step allows for a closed-form solution $z^{i}_{k}=d\cdot [0,\ldots ,0,-\operatorname{sgn}[s^{i}_{k}]_{h},0,\ldots ,0]^{ \mathrm{T}}$ with $h=\operatorname{argmax}_{j}|[s^{i}_{k}]_{j}|$ in Algorithm 1. Moreover, when $\mathcal{X}$ is a nuclear norm ball, solving (6) requires computing only a single pair of singular vectors corresponding to the largest singular value, whereas computing a projection onto $\mathcal{X}$ demands a complete SVD decomposition.

3 Assumptions and convergence analysis

This section dedicates to analyzing the convergence performance of Algorithm 1. Before providing main results, several standard assumptions are required.

3.1 Assumptions and facts

Assumption 1

The network $\mathcal{G}$ is connected.

Assumption 2

The weighted adjacency matrix W is doubly stochastic.

Assumptions 1 and 2 indicate that for each round of the Step 1 in Algorithm 1, the agent takes a weighted average of the values from its neighbors according to W. In addition, these assumptions [26] also imply that the matrix W’s second largest eigenvalue λ satisfies $|\lambda |<1$. The following fact is true under Assumptions 1 and 2 [26].

Fact 1

Let $\bar{x}=\frac{1}{N} \sum_{i=1}^{N} x^{i}$ and $\bar{x}^{i}=\sum_{j=1}^{N} w_{ij}x^{j}$. Then, $(\sum_{i=1}^{N} \|\bar{x}^{i} - \bar{x}\|^{2} )^{ \frac{1}{2}}\leqslant |\lambda | ( \sum_{i=1}^{N} \|x^{i} - \bar{x}\|^{2} )^{\frac{1}{2}}$.

Fact 1 suggests that each update in the average consensus process (Step 1) incrementally aligns the iteration variables more closely with their mean value x̄. To streamline our convergence analysis, we introduce $k_{0}\in \mathbb{R}_{+}$ as the smallest integer such that $|\lambda |\leqslant [k_{0}/(k_{0}+1)]^{2}$. Clearly, $k_{0}=\lceil (|\lambda |^{-\frac{1}{2}}-1)^{-1}\rceil $.

Assumption 3

$H_{i}(\cdot )$ and $h_{i}(\cdot ,\xi ^{i})$ are L-smooth functions on the constraint set $\mathcal{X}$ for all $i\in \mathcal{N}$ and $\xi ^{i}\in \mathbb{R}^{p}$.

Furthermore, we posit an additional assumption regarding the constraint set $\mathcal{X}$, which forms a foundational element in the context of FW-based methods [3, 4, 14, 22].

Assumption 4

$\mathcal{X}$ is compact and convex, that is, $\|x-y\|\leqslant d$ for all $x,y\in \mathcal{X}$, where d is a positive constant.

Assumption 5

The variance of $\nabla h_{i}(x,\xi ^{i})$ is bounded for all $x\in \mathcal{X}$ and $i\in \mathcal{N}$. That is, there exists a constant δ such that $\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})-\nabla H_{i}(x)\|^{2}] \leqslant \delta ^{2}$, where $H_{i}(x)=\mathbb{E}[h_{i}(x,\xi ^{i})]$.

Fact 2

(see [13])

If Assumptions 4–5hold, there is a positive constant l such that $\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})\|^{2}]\leqslant l^{2}$ and $\mathbb{E}[\|\nabla h_{i}(x, \xi ^{i})\|]\leqslant l$.

Assumptions 3–5 are standard assumptions in stochastic FW methods [3, 4, 9, 13, 14, 22]. If Assumption 3 holds, the following fact is true.

Fact 3

Define $\hat{\nabla}H_{i}(x^{i}):=\sum^{n}_{j=1} \frac{H_{i}(x^{i}+\rho e_{j})-H_{i}(x^{i}-\rho e_{j})}{2\rho}e_{j}= \mathbb{E}[{\hat{\nabla}h_{i}(x^{i},\xi ^{i})}]$, where $\hat{\nabla}h_{i}(x^{i},\xi ^{i})$ defined in (8). Then, for any $x^{i}\in \mathcal{X}$ ($i\in \mathcal{N}$) and $\xi ^{i}\in \mathbb{R}^{p}$,

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i}, \xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \leqslant nL^{2}\rho ^{2}, \end{aligned}$$

(5)

$$\begin{aligned} & \bigl\Vert \hat{\nabla}H_{i} \bigl(x^{i} \bigr)-\nabla H_{i} \bigl(x^{i} \bigr) \bigr\Vert ^{2}\leqslant nL^{2} \rho ^{2}. \end{aligned}$$

(6)

Proof

We first prove (11). It follows from the definition of $\hat{\nabla}h_{i}(x^{i},\xi ^{i})$ and the mean value theorem to $\nabla h_{i}(x^{i},\xi ^{i})$ that there exists $\alpha _{j}\in (0,1)$ such that

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad = \Biggl\Vert \sum_{j=1}^{n} \frac{h_{i}(x^{i}+\rho e_{j},\xi ^{i})-h_{i}(x^{i}-\rho e_{j},\xi ^{i})}{2\rho}e_{j} \\ &\qquad {}- \nabla h_{i} \bigl(x^{i}, \xi ^{i} \bigr) \Biggr\Vert ^{2} \\ &\quad = \Biggl\Vert \frac{1}{2\rho}\sum_{j=1}^{n} \bigl(2\rho e_{j}e^{\mathrm{T}}_{j} \nabla h_{i} \bigl(x^{i}+(2\alpha _{j}-1)\rho e_{j},\xi ^{i} \bigr) \bigr) \\ &\qquad {}-\nabla h_{i} \bigl(x^{i}, \xi ^{i} \bigr) \Biggr\Vert ^{2}. \end{aligned}$$

It follows from the property of the basis vector $e_{j}$ and Euclidean norm that

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl(x^{i},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad =\sum_{j=1}^{n} \bigl\Vert e_{j}e^{\mathrm{T}}_{j} \bigl(\nabla h_{i} \bigl(x^{i}+(2 \alpha _{j}-1)\rho e_{j},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr) \bigr\Vert ^{2} \\ &\quad \leqslant \sum_{j=1}^{n} \bigl\Vert \nabla h_{i} \bigl(x^{i}+(2\alpha _{j}-1) \rho e_{j},\xi ^{i} \bigr)-\nabla h_{i} \bigl(x^{i},\xi ^{i} \bigr) \bigr\Vert ^{2} \\ &\quad \leqslant L^{2}\sum_{j=1}^{n} \bigl\Vert (2\alpha _{j}-1)\rho e_{j} \bigr\Vert ^{2} \\ &\quad \leqslant nL^{2}\rho ^{2}, \end{aligned}$$

where we use Assumption 3 in the second inequality. We obtain Eqn. (12) in a similar way. □

Fact 4

(see [13])

For any vectors $v_{1},\ldots , v_{N}\in \mathbb{R}^{n}$,

$$\begin{aligned} \Vert v_{1} + \cdots + v_{N} \Vert ^{2}\leqslant N \bigl( \Vert v_{1} \Vert ^{2} + \cdots + \Vert v_{N} \Vert ^{2} \bigr). \end{aligned}$$

(7)

Assumptions 1–5 and Facts 2–4 are crucial to the subsequent analysis. They serve as the theoretical groundwork upon which our analysis is constructed, ensuring a rigorous foundation for the methodologies employed and the conclusions drawn.

3.2 Convergence analysis

For the convenience of analysis, we define

$$\begin{aligned}& \bar{x}_{k}:=\frac{1}{N}\sum_{i=1}^{N}x^{i}_{k},\qquad \bar{g}_{k}:= \frac{1}{N}\sum_{i=1}^{N}g^{i}_{k}, \\& \bar{p}_{k}:=\frac{1}{N}\sum_{i=1}^{N} \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr). \end{aligned}$$

The following lemma estimates the tracking error for the average iterate in Algorithm 1, and we provide the proof in Appendix 1.2.

Lemma 1

Let $\gamma _{k}=\frac{2}{k+2}$. If Assumptions 1, 2and 4hold, then, for any $i\in \mathcal{N}$ and $k\geqslant 1$, $\|\bar{x}^{i}_{k}-\bar{x}_{k}\|\leqslant \frac{2C_{1}}{k+2}$ and $\|\bar{x}^{i}_{k+1}-\bar{x}^{i}_{k}\|\leqslant \frac{2(d+2C_{1})}{k+2}$, where $C_{1}$ is defined in Table 2and $k\geqslant 1$.

Table 2 The nomenclature of values employed in this article

Full size table

Lemma 1 shows that the averaged iterate estimation $\bar{x}^{i}_{k}$ approximates to the real average value $\bar{x}_{k}$ at a rate of $\mathcal{O}(1/k)$.

We provide the performance of the averaged gradient tracking for Algorithm 1 in the following lemma. Appendix 1.4 presents the proof of Lemma 2.

Lemma 2

Suppose Assumptions 1–5hold. If $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$ and $0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, then

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert \bar{g}_{k}-s^{i}_{k} \bigr\Vert ^{2} \bigr]\leqslant \frac{4C_{2}}{(k+2)^{2}}, \end{aligned}$$

(8)

where $C_{2}$ is defined in Table 2and $k\geqslant 1$.

Lemma 2 establishes that $\mathbb{E}[\|\bar{g}_{k}-s^{i}_{k}\|^{2}]=\mathcal{O}(1/k^{2})$, which implies that $\|\bar{g}_{k}-s^{i}_{k}\|$ converges to zero as $k\rightarrow +\infty $ in expectation.

The following lemma plays an important role in the convergence analysis of Algorithm 1.

Lemma 3

Define $\hat{\nabla}\bar{h}_{k}:=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{k}[ \hat{\nabla}h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})]$. If Assumptions 1–5hold, the following two relations are established.

1) For any $k\geqslant 1$, it holds that

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert \bar{g}_{k}-\hat{\nabla}\bar{h}_{k} \Vert ^{2} \bigr] \\ &\quad \leqslant (1- \beta _{k})^{2}\mathbb{E} \bigl[ \Vert \bar{g}_{k-1}-\hat{\nabla}\bar{h}_{k-1} \Vert ^{2} \bigr]+60nL^{2} \rho ^{2}_{k-1} \\ &\qquad {} +6\delta ^{2}\beta _{k}^{2}+24L^{2} \gamma ^{2}_{k-1}(d+2C_{1})^{2}. \end{aligned}$$

(9)

2) If $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, then for any $k\geqslant 1$,

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert \bar{g}_{k}-\bar{p}_{k} \Vert ^{2} \bigr]\leqslant \frac{2C_{3}+2L^{2}(d+2C_{1})^{2}}{k+2}, \end{aligned}$$

(10)

where $C_{3}$ and $C_{1}$ are defined in Table 2.

The proof of Lemma 3 is provided in Appendix 1.5.

Lemma 3 shows that the variable $\bar{g}_{k}$ tracks the real average gradient $\bar{p}_{k}$ with an average error bounded by $\mathcal{O}(\frac{C_{3}+L^{2}(d+C_{1})^{2}}{k+2})$. That is, the expected error of the approximation in stochastic gradient diminishes as the number of iterations increases. Making use of Lemmas 2 and 3, the following lemma is established.

Lemma 4

Choose $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$. If Assumptions 1–5hold, then, for any $k\geqslant 1$ and $i\in \mathcal{N}$,

$$\begin{aligned} &\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant \frac{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}}{k+2}. \end{aligned}$$

(11)

The proof is presented in Appendix 1.6.

The following two theorems establish convergence rates of Algorithm 1 for convex and nonconvex objectives, respectively.

Theorem 1

(Convex objective) Let Assumptions 1–5hold. Choose $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$. If $h_{i}(\cdot ,\xi ^{i})$ is convex for any $i\in \mathcal{N}$ and $\xi ^{i}$, then

$$\begin{aligned} \mathbb{E} \bigl[h(\bar{x}_{k+1}) \bigr]-h \bigl(x^{*} \bigr) \leqslant \frac{C_{4}}{(k+3)^{\frac{1}{2}}},\quad \forall k\geqslant 1, \end{aligned}$$

where $C_{4}$ is defined in Table 2.

The proof of Theorem 1 is presented in Appendix 1.7.

Theorem 1 indicates that the convergence rate of Algorithm 1 is $\mathcal{O}(1/k^{\frac{1}{2}})$. The result can be directly translated into finding an ϵ-optimal solution to problem (1). The numbers of calls to SZO and LMO for ϵ-optimal solutions are $\mathcal{O}(\frac{nC^{2}_{4}}{\epsilon ^{2}})$ and $\mathcal{O}(\frac{C^{2}_{4}}{\epsilon ^{2}})$, respectively.

For the nonconvex case, we introduce a convergence criterion used for standard FW methods, aka FW-gap [4, 13, 14, 22], which is

$$\begin{aligned} p_{k}=\max_{x\in \mathcal{X}} \bigl\langle \nabla h( \bar{x}_{k}),\bar{x}_{k}-x \bigr\rangle . \end{aligned}$$

(12)

Based on the convergence measure (18), we establish the following theorem for problem (1) with nonconvex objective functions.

Theorem 2

(Nonconvex objective) Suppose Assumptions 1–5hold. Choose $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$. Then,

$$\begin{aligned} \mathbb{E} \Bigl[\min_{k\in \{1,\ldots ,K\}}p_{k} \Bigr]\leqslant{}& \frac{1}{\log _{2}(K)-1} \bigl(h(\bar{x}_{1})-h( \bar{x}_{K+1})+4Ld \\ &{} +c\sqrt{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}} \bigr), \end{aligned}$$

where $c\in \mathbb{R}$ satisfies $\sum_{k=1}^{2^{m}}(4d/(k+2)^{\frac{3}{2}})\leqslant c$.

The proof of Theorem 2 is presented in Appendix 1.8.

Theorem 2 shows that Algorithm 1 converges to a stationary point at a rate of $\mathcal{O}(1/\log _{2}(K))$ when the objective function is nonconvex. The total number of calls to SZO and LMO are $\mathcal{O}(2^{\frac{\Gamma}{\epsilon}}d)$ and $\mathcal{O}(2^{\frac{\Gamma}{\epsilon}})$ for finding an ϵ-stationary point, respectively.

Remark 4

Table 1 shows that both the number of calls and the function query-size to SZO of Algorithm 1 are significantly less than those in ZSCG and ZSAGMIU [4], at the cost of a larger complexity bound on LMO. In addition, Algorithm 1 has the same complexity bounds for both SZO and LMO as those in the recently proposed centralized method MOST-FW [3]. Compared with the existing decentralized zeroth-order FW method DSGFF [22], which requires a central coordinator, the fully distributed Algorithm 1 has a lower complexity bound of SZO in the convex case and a weaker dimensional dependency of SZO in the nonconvex case.

Remark 5

It is worth noting that the step sizes we use are monotone decreasing, different from the existing zeroth-order nonconvex FW methods [4, 14, 22]. The step sizes mentioned in these references depend on the total iteration number K and the dimension of the variable.

4 Numerical simulations

In this section, we apply Algorithm 1 (DSZO-FW) to solve a black-box distributed stochastic binary classification problem with convex and nonconvex objectives, respectively. To solve such problems, DSZO-FW is applied over a connected network $\mathcal{G}$ with $N=5$ agents and a doubly stochastic adjacency matrix W. The communication graph is a ring topology, and each agent only accesses its own objective function $h_{i}$. We construct matrix W by using maximum-degree weights. Specifically, the maximum degree of ring topology is $d_{\max}=2$. For any edge $(i,j)$ in the graph, the weight $w_{ij}$ is set as $w_{ij}=1/(1+d_{\max})$ for all $i\neq j$. The diagonal elements $w_{ii}$ are then set to make the rows sum up to 1, which typically results in $w_{ii}=1-\sum_{j\in \mathcal{N}_{i}}w_{ij}$, where $\mathcal{N}_{i}$ denotes the set of neighbors of node i. We set the constraint set to an $l_{1}$-norm ball such that $\mathcal{X}=\{x|\|x\|_{1}\leqslant d\}$. Here we assume $d=5$.

For better evaluating the performance of DSZO-FW, we compare it against centralized algorithms ZSCG [4], SGFFW [23], and MOST-FW [3] as baselines. In the experiments, we use three public datasets^{Footnote 2} (covtype.binary, a9a and w8a) and suppose that each iteration randomly obtains only 1% of data. Because a large batch size $m_{k}$ (related to the dimension and the total number of iterations) required by ZSCG exceeds the total number of samples in these three datasets, we regard ZSCG as a deterministic algorithm in the experiment, which uses full data to compute the function value. We evaluate these four algorithms according to the FW-gap, which is defined in (18).

4.1 Black-box binary classification with convex objectives

This subsection dedicates to verifying the theoretical results of DSZO-FW in the convex case. Our goal is to find an optimal solution $x\in \mathbb{R}^{n}$ by solving the following stochastic binary classification problem:

$$\begin{aligned} &\min_{x\in \mathcal{X}}h(x),\quad h(x):= \frac{1}{N}\sum _{i=1}^{N}h_{i}(x), \\ &h_{i}(x):=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}} \mathbb{E}_{a_{ij},b_{ij}} \bigl[\mathrm{ln} \bigl(1+\operatorname{exp} \bigl(-b_{ij}\langle a_{ij},x\rangle \bigr) \bigr) \bigr], \end{aligned}$$

where $(a_{ij},b_{ij})^{m_{i}}_{j=1}$ are $m_{i}$ (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the convex setting, i.e., $\alpha _{k}=6/(k+5)$ for ZSCG [4]; $\rho _{k}=4/(k+8)^{\frac{2}{3}}$, $\gamma _{k}=2/(k+8)$ and $c_{k}=2/(n^{\frac{1}{2}}(k+8)^{\frac{1}{3}})$ for SGFFW [23]; $\gamma _{k}=1/k$, $\eta _{k}=2/(k+1)$, $\mu _{k}=0$ and $\rho _{k}=d/\sqrt{n}(k+1)$ for MOST-FW [3]; $\beta _{k}=2/(k+1)$, $\gamma _{k}=2/(k+2)$ and $\rho _{k}=d/\sqrt{n}(k+2)$ for DSZO-FW.

Figure 1 shows the convergence performance of these four algorithms on a convex binary classification problem. We observe that DSZO-FW and MOST-FW perform a smaller FW-gap than ZSCG and SGFFW, especially on dataset $w8a$, although they use less data than ZSCG. This dedicates that the local gradient estimate via the recursive momentum technique might be a better candidate for approximating the gradient. We observe the periodic vibrate on the curves of these four algorithms, especially on datasets $a9a$ and $w8a$. We intuitively believe that this phenomenon occurs due to the imprecise estimation of the gradient estimator and the gradient variance reduced period via the variance reduction technique.

4.2 Black-box binary classification with nonconvex objectives

In this subsection, we dedicate to verifying the theoretical results of DSZO-FW in the nonconvex case. Consider the following stochastic binary classification problem with nonconvex objective functions:

$$\begin{aligned} &\min_{x\in \mathcal{X}}h(x),\quad h(x):= \frac{1}{N}\sum _{i=1}^{N}h_{i}(x), \\ &h_{i}(x):=\frac{1}{m_{i}}\sum_{j=1}^{m_{i}} \mathbb{E}_{a_{ij},b_{ij}} \biggl[\frac{1}{1+\operatorname{exp}(b_{ij}\langle a_{ij},x\rangle )} \biggr], \end{aligned}$$

where $(a_{ij},b_{ij})^{m_{i}}_{j=1}$ are $m_{i}$ (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the nonconvex setting, i.e., $\alpha _{k}=1/T^{\frac{1}{2}}$ for ZSCG [4]; $\gamma _{k}=1/T^{\frac{3}{4}}$, $\rho _{k}=4/((k+8)^{\frac{2}{3}}(1+n)^{\frac{1}{3}})$, and $c_{k}=2/(n^{\frac{3}{2}}(k+8)^{\frac{1}{3}})$ for SGFFW [23]; $\gamma _{k}=1/k$, $\eta _{k}=2/(k+1)$, $\mu _{k}=0$, and $\rho _{k}=d/\sqrt{n}(k+1)$ for MOST-FW [3]; $\beta _{k}=2/(k+1)$, $\rho _{k}=d/\sqrt{n}(k+2)$, and $\gamma _{k}=2/(k+2)$ for DSZO-FW. Note that MOST-FW is not proven to be convergent for the nonconvex case. We implement the algorithm only for comparison purposes.

Figure 2 shows the convergence performance measured by FW-gap of these four algorithms on a nonconvex binary classification problem. The results show that DSZO-FW converges faster than ZSCG and SGFFW in both three datasets. In contrast, DSZO-FW has a comparable convergence performance to MOST-FW on datasets $a9a$ and $w8a$, demonstrating the efficacy of the variance reduction technique used in DSZO-FW and MOST-FW. Similar to Fig. 1, the periodic vibrate on the curves of these four algorithms also appears, especially on datasets $a9a$ and $w8a$. We infer that this phenomenon occurs because the variance of the gradient estimator is too high in these two cases.

5 Conclusions

This paper proposed a novel algorithm in a projection-free and gradient-free manner for distributed stochastic optimization problems accessing only the stochastic zeroth-order oracle (SZO). The proposed algorithm only requires a single batch size to guarantee convergence using recursive momentum and gradient tracking techniques. We proved that the proposed algorithm has the comparable complexity bound $\mathcal{O}(n/\epsilon ^{2})$ on SZO as that of the centralized best results for the convex case. For the nonconvex case, the algorithm has a complexity bound $\mathcal{O}(n/(2^{\frac{1}{\epsilon}}))$ on SZO under mild conditions. The efficacy of the proposed algorithm is demonstrated through simulation experiments on multiple datasets. Our future works include extending the algorithm to stochastic nonsmooth optimization problems and introducing variance reduction techniques to obtain a better convergence performance.

Data availability

Not applicable.

Notes

The following results are normalized to find an ϵ-optimal solution for convex optimization problems and ϵ-stationary point for nonconvex optimization problems. The symbol n denotes the dimension of the strategy variable.
Available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

S. Aeron, V. Saligrama, D.A. Castanon, Efficient sensor management policies for distributed target tracking in multihop sensor networks. IEEE Trans. Signal Process. 56(6), 2562–2574 (2008)
Article MathSciNet Google Scholar
Z. Akhtar, K. Rajawat, Momentum based projection free stochastic optimization under affine constraints, in American Control Conf. (2021), pp. 2619–2624
Google Scholar
Z. Akhtar, K. Rajawat, Zeroth and first order stochastic Frank-Wolfe algorithms for constrained optimization. IEEE Trans. Signal Process. 70, 2119–2135 (2022)
Article MathSciNet Google Scholar
K. Balasubramanian, S. Ghadimi, Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates, in Proc. Int. Conf. Neural Inf. Process. Syst. (2018), pp. 3459–3468
Google Scholar
A. Bellet, Y. Liang, A.B. Garakani et al., A distributed Frank-Wolfe algorithm for communication-efficient sparse learning, in Proc. SIAM Int. Conf. Data Mining (2015), pp. 478–486. https://doi.org/10.1137/1.9781611974010.54
Chapter Google Scholar
G. Chen, P. Yi, Y. Hong et al., Distributed optimization with projection-free dynamics: a Frank-Wolfe perspective. IEEE Trans. Cybern. 54(1), 599–610 (2024). https://doi.org/10.1109/TCYB.2023.3284822
Article Google Scholar
J. Chen, J. Sun, G. Wang, From unmanned systems to autonomous intelligent systems. Engineering 12(5), 16–19 (2022)
Article Google Scholar
P. Chen, H. Zhang, Y. Sharma et al., ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models, in Proc. ACM. Work. Artif. Intell. Sec. (2017), pp. 15–26
Google Scholar
A. Cutkosky, F. Orabona, Momentum-based variance reduction in non-convex SGD, in Proc. Adv. Neural Inf. Process. Syst. (2019), pp. 15210–15219
Google Scholar
M. Frank, P. Wolfe, An algorithm for quadratic programming. Nav. Res. Logist. 3(1–2), 95–110 (1956)
Article MathSciNet Google Scholar
K. Fu, H. Chen, W. Zhao, Distributed dynamic stochastic approximation algorithm over time-varying networks. Auton. Intell. Syst. 1(5) (2021). https://doi.org/10.1007/s43684-021-00003-1
E. Hazan, H. Luo, Variance-reduced and projection-free stochastic optimization, in Proc. Int. Conf. Mach. Learn (2016)
Google Scholar
J. Hou, X. Zeng, G. Wang et al., Distributed momentum-based Frank-Wolfe algorithm for stochastic optimization. IEEE/CAA J. Autom. Sin. 10(3), 676–690 (2023)
Google Scholar
F. Huang, S. Chen, Accelerated stochastic gradient-free and projection-free methods, in Proc. Int. Conf. Mach. Learn. (2020), pp. 4519–4530
Google Scholar
M. Jaggi, Revisiting Frank-Wolfe: projection-free sparse convex optimization, in Proc. Int. Conf, Mach. Learn., Atlanta, GA, USA (2013), pp. 427–435
Google Scholar
Y. Kuriki, T. Namerikawa, Consensus-based cooperative formation control with collision avoidance for a multi-UAV system, in American Control Conf. (2014), pp. 2077–2082
Google Scholar
D. Li, N. Li, L. Lewis, Projection-free distributed optimization with nonconvex local objective functions and resource allocation constraint. IEEE Trans. Control Netw. Syst. 8(1), 413–422 (2021)
Article MathSciNet Google Scholar
A. Mokhtari, H. Hassani, A. Karbasi, Stochastic conditional gradient methods: from convex minimization to submodular maximization. J. Mach. Learn. Res. 21(105), 1–49 (2020)
MathSciNet Google Scholar
S. Pu, A. Olshevsky, I.C. Paschalidis, Asymptotic network independence in distributed stochastic optimization for machine learning: examining distributed and centralized stochastic gradient descent. IEEE Signal Process. Mag. 37(3), 114–122 (2020)
Article Google Scholar
R. Rubinstein, D. Kroese, Simulation and the Monte Carlo Method, vol. 10 (Wiley, New York, 2016)
Book Google Scholar
A. Sahu, D. Jakovetic, D. Bajovic et al., Distributed zeroth order optimization over random networks: a Kiefer-Wolfowitz stochastic approximation approach, in IEEE Conf. Decision Contr. (2018), pp. 4951–4958. https://doi.org/10.1109/CDC.2018.8619044
Chapter Google Scholar
A. Sahu, S. Kar, Decentralized zeroth-order constrained stochastic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks. Proc. IEEE 108(11), 1890–1905 (2020)
Article Google Scholar
A. Sahu, M. Zaheer, S. Kar, Towards gradient free and projection free stochastic optimization, in Proc. Int. Conf. Artif. Intell. Statis. (2019), pp. 3468–3477
Google Scholar
T. Salimans, J. Ho, X. Chen et al., Evolution strategies as a scalable alternative to reinforcement learning (2017). arXiv preprint https://doi.org/10.48550/arXiv.1703.03864
P. Sun, Z. Guo, G. Wang et al., MARVEL: enabling controller load balancing in software-defined networks with multi-agent reinforcement learning. Comput. Netw. 177, 107230 (2020)
Article Google Scholar
H. Wai, J. Lafond, A. Scaglione et al., Decentralized Frank-Wolfe algorithm for convex and nonconvex problems. IEEE Trans. Autom. Control 62(11), 5522–5537 (2017)
Article MathSciNet Google Scholar
D. Wang, Z. Wang, Z. Wu, Distributed convex optimization for nonlinear multi-agent systems disturbed by a second-order stationary process over a digraph. Sci. China Inf. Sci. 65, 132201 (2022). https://doi.org/10.1007/s11432-020-3111-4
Article MathSciNet Google Scholar
G. Wang, S. Lu, G.B. Giannakis et al., Decentralized TD tracking with linear function approximation and its finite-time analysis, in Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 1154 (2020), pp. 13762–13772
Google Scholar
Z. Wang, J. Zhang, T. Chang et al., Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems. IEEE Trans. Signal Process. 69, 4486–4501 (2021)
Article MathSciNet Google Scholar
Y. Xu, H. Deng, W. Zhu, Synchronous distributed admm for consensus convex optimization problems with self-loops. Inf. Sci. 614, 185–205 (2022)
Article Google Scholar
R. Yang, L. Liu, G. Feng, An overview of recent advances in distributed coordination of multi-agent systems. Unmanned Syst. 10(03), 307–325 (2022)
Article Google Scholar
X. Yi, S. Zhang, T. Yang et al., Linear convergence of first- and zeroth-order primal–dual algorithms for distributed nonconvex optimization. IEEE Trans. Autom. Control 67(8), 4194–4201 (2022)
Article MathSciNet Google Scholar
X. Yi, S. Zhang, T. Yang et al., Zeroth-order algorithms for stochastic distributed nonconvex optimization. Automatica 142, 110353 (2022)
Article MathSciNet Google Scholar
Z. Yu, D.W. Ho, D. Yuan, Distributed randomized gradient-free mirror descent algorithm for constrained optimization. IEEE Trans. Autom. Control 67(2), 957–964 (2022)
Article MathSciNet Google Scholar
D. Yuan, B. Zhang, D.W. Ho et al., Distributed online bandit optimization under random quantization. Automatica 146, 110590 (2022)
Article MathSciNet Google Scholar
S. Zhang, C.P. Bailey, Accelerated zeroth-order algorithm for stochastic distributed non-convex optimization, in American Contr. Conf. (2022), pp. 4274–4279. https://doi.org/10.23919/ACC53348.2022.9867306
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers and potential users for their valuable comments and suggestions.

Funding

This work was supported by the National Natural Science Foundation of China under Grant Nos. 62222303, 62073035, and 62088101.

Author information

Authors and Affiliations

National Key Laboratory of Autonomous Intelligent Unmanned Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Jie Hou, Xianlin Zeng & Chen Chen

Authors

Jie Hou
View author publications
You can also search for this author in PubMed Google Scholar
Xianlin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the design and implementation of the research. Material preparation and analysis were completed by Jie Hou, Xianlin Zeng and Chen Chen. Jie Hou and Xianlin Zeng contributed to the problem formulation, discussion of ideas, mathematical derivation and proof of results. Chen Chen contributed to the problem formulation and discussion of ideas. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xianlin Zeng or Chen Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 1.1 Technical lemmas for Lemma 1

We first provide some technical lemmas before proving Lemma 1.

Lemma 5

(Lemma 2, [2]) Let $\{\Pi _{k}\}$ be a sequence of real numbers such that

$$\begin{aligned} \Pi _{k}= \biggl(1-\frac{A_{1}}{(k+t_{0})^{a_{1}}} \biggr)\Pi _{k-1}+ \frac{A_{2}}{(k+t_{0})^{a_{2}}}, \end{aligned}$$

for some $a_{1}\in [0,1]$ satisfying $a_{1}\leqslant a_{2}\leqslant 2a_{1}$, $A_{1}> 1$ and $A_{2}\geqslant 0$. Then $\Pi _{k}$ converges to zero at a rate of

$$\begin{aligned} \Pi _{k}\leqslant \frac{A}{(k+t_{0}+1)^{a_{2}-a_{1}}}, \end{aligned}$$

where $A=\max \{\Pi _{0}(t_{0}+1)^{a_{2}-a_{1}}, \frac{A_{2}}{A_{1}-1} \}$.

Lemma 6

For all $k=1,2,\ldots ,K$, if Assumption 1holds, we have the following relationships:

(a) $\frac{1}{N}\sum_{i=1}^{N}y_{k+1}^{i}=\bar{g}_{k+1}$;

(b) $\bar{x}_{k+1}=(1-\gamma _{k})\bar{x}_{k}+\gamma _{k}\bar{z}_{k}$, where $\bar{z}_{k}=\frac{1}{N}\sum_{i=1}^{N}z_{k}^{i}$.

Proof

(a) It follows from (4) of Algorithm 1 that

$$\begin{aligned} \frac{1}{N}\sum_{i=1}^{N}y_{k+1}^{i} &=\frac{1}{N}\sum_{i=1}^{N} \Biggl(g_{k+1}^{i}-g_{k}^{i}+\sum _{j=1}^{N}w_{ij}y_{k}^{j} \Biggr) \\ &=\bar{g}_{k+1}-\bar{g}_{k}+\frac{1}{N}\sum _{i=1}^{N}y_{k}^{i} \\ &=\frac{1}{N}\sum_{i=1}^{N}g_{1}^{i}- \bar{g}_{1}+\bar{g}_{k+1} =\bar{g}_{k+1}, \end{aligned}$$

where the fact that the matrix W is doubly stochastic is used in the second equality. Hence, $\frac{1}{N}\sum_{i=1}^{N}y_{k+1}^{i}=\bar{g}_{k+1}$.

(b) According to the definitions of $\bar{x}_{k}$ and $x_{k}^{i}$,

$$\begin{aligned} \bar{x}_{k+1}&=\frac{1}{N}\sum_{i=1}^{N} \Biggl[\gamma _{k} z_{k}^{i}+(1- \gamma _{k})\sum_{j=1}^{N}w_{ij}x_{k}^{j} \Biggr] \\ &=\frac{(1-\gamma _{k})}{N}\sum_{i=1}^{N}x_{k}^{i}+ \frac{\gamma _{k}}{N}\sum_{i=1}^{N}z_{k}^{i} \\ &=\gamma _{k}\bar{z}_{k}+(1-\gamma _{k}) \bar{x}_{k}, \end{aligned}$$

where the fact that the matrix W is doubly stochastic is used in the first equality. The proof is completed. □

1.2 1.2 Proof of Lemma 1

Proof

In the first step, we prove that $\|\bar{x}^{i}_{k}-\bar{x}_{k}\|\leqslant C_{1}\gamma _{k}$.

We derive that $\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant \max_{i\in \mathcal{N}}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant ( \sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|^{2})^{\frac{1}{2}}$ from the properties of Euclidean norm. Next, we make a proof on the following inequality by using induction on k,

$$\begin{aligned} \Biggl(\sum_{i=1}^{N} \bigl\Vert \bar{x}_{k}^{i}-\bar{x}_{k} \bigr\Vert ^{2} \Biggr)^{ \frac{1}{2}}\leqslant \frac{2C_{1}}{k+2}= \frac{2k_{0}\sqrt{N}d}{k+2}=C_{1} \gamma _{k}. \end{aligned}$$

(13)

It can be observed that (19) holds for $k=1$ to $k=k_{0}-2$.

We assume that (19) holds for some $k\geqslant k_{0}-2$ in the induction step. It follows from Lemma 6 (b) and (7) that

$$\begin{aligned} &\sum_{i=1}^{N} \bigl\Vert \bar{x}_{k+1}^{i} - \bar{x}_{k + 1} \bigr\Vert ^{2} \\ &\quad =\sum_{i=1}^{N} \Biggl\Vert \sum _{j=1}^{N}w_{ij}(1 - \gamma _{k})\bar{x}_{k}^{j} \\ &\qquad {}+ \sum_{j=1}^{N} w_{ij}\gamma _{k}z_{k}^{j} - (1 - \gamma _{k}) \bar{x}_{k} - \gamma _{k} \bar{z}_{k} \Biggr\Vert ^{2} \\ &\quad =\sum_{i=1}^{N} \Biggl\Vert \sum _{j=1}^{N}w_{ij} \Biggl[(1-\gamma _{k}) \sum_{h=1}^{N}w_{jh}x_{k}^{h}+ \gamma _{k} z_{k}^{j} \Biggr] \\ &\qquad {} -\frac{1}{N}\sum_{j=1}^{N} \Biggl[(1-\gamma _{k})\sum_{h=1}^{N}w_{jh}x_{k}^{h}+ \gamma _{k} z_{k}^{j} \Biggr] \Biggr\Vert ^{2} \\ &\quad \leqslant \vert \lambda \vert ^{2}\sum _{i=1}^{N} \bigl\Vert (1-\gamma _{k}) \bigl( \bar{x}_{k}^{i} - \bar{x}_{k} \bigr) + \gamma _{k} \bigl(z_{k}^{i}-\bar{z}_{k} \bigr) \bigr\Vert ^{2}, \end{aligned}$$

(14)

where λ is the second largest eigenvalue of W, and we use Fact 1 in the last inequality. Next, we provide an upper bound on $\sum_{i=1}^{N}\|(1-\gamma _{k})(\bar{x}_{k}^{i}-\bar{x}_{k})+ \gamma _{k}(z_{k}^{i}-\bar{z}_{k})\|^{2}$.

$$\begin{aligned} & \sum_{i=1}^{N} \bigl\Vert \gamma _{k} \bigl(z_{k}^{i}- \bar{z}_{k} \bigr)+(1-\gamma _{k}) \bigl( \bar{x}_{k}^{i}-\bar{x}_{k} \bigr) \bigr\Vert ^{2} \\ &\quad \overset{(\mathrm{a})}{\leqslant}\sum_{i=1}^{N} \bigl[\gamma ^{2}_{k}d^{2}+(1- \gamma _{k})^{2} \bigl\Vert \bar{x}_{k}^{i}- \bar{x}_{k} \bigr\Vert ^{2} \\ &\qquad {}+2\gamma _{k}(1 - \gamma _{k})D \bigl\Vert \bar{x}_{k}^{i} - \bar{x}_{k} \bigr\Vert \bigr] \\ &\quad \overset{(\mathrm{b})}{\leqslant}\sum_{i=1}^{N} \bigl( \bigl\Vert \gamma _{k}{x}_{k}^{i}- \bar{x}_{k} \bigr\Vert ^{2}+2\gamma _{k} d \bigl\Vert \bar{x}_{k}^{i}-\bar{x}_{k} \bigr\Vert + \gamma ^{2}_{k}d^{2} \bigr) \\ &\quad \overset{(\mathrm{c-})}{\leqslant}C_{1}^{2}\gamma ^{2}_{k}+n\gamma ^{2}_{k}d^{2}+2 \gamma _{k} d\sqrt{N}\sqrt{\sum_{i=1}^{N} \bigl\Vert \bar{x}_{k}^{i}-\bar{x}_{k} \bigr\Vert ^{2}} \\ &\quad \leqslant C_{1}^{2}\gamma ^{2}_{k}+N \gamma ^{2}_{k}d^{2}+2d\gamma _{k}^{2} \sqrt{N}C_{1} \\ &\quad =\gamma _{k}^{2} \bigl(C_{1}+C_{1}{k_{0}}^{-1} \bigr)^{2}= \biggl( \frac{k_{0}+1}{k_{0}}C_{1}\gamma _{k} \biggr)^{2}, \end{aligned}$$

(15)

where (a) holds because of Assumption 4; (b) is due to $1-\gamma _{k}\leqslant 1$; (c) follows from $\sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant \sqrt{N} \sqrt{\sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|^{2}}$ and the induction hypothesis (19). Substituting (21), $\lambda \leqslant (\frac{k_{0}}{k_{0}+1} )^{2}$ and $\gamma _{k}=\frac{2}{k+2}$ into (20), it has

$$\begin{aligned} \sum_{i=1}^{N} \bigl\Vert \bar{x}_{k+1}^{i}-\bar{x}_{k+1} \bigr\Vert ^{2} &\leqslant \biggl(\frac{2(k_{0}+1)}{k_{0}(k+2)} \biggl(\frac{k_{0}}{k_{0}+1} \biggr)^{2}C_{1} \biggr)^{2} \\ &\leqslant \biggl(\frac{2(k+2)}{(k+2+1)(k+2)}C_{1} \biggr)^{2} \\ &=C^{2}_{1} \gamma ^{2}_{k+1}, \end{aligned}$$

(16)

where we use the monotonically increasing property of function $g(x)=x/(1+x)$ with respect to x over $[0,\infty )$ in the second inequality. We obtain $\sum_{i=1}^{N}\|\bar{x}_{k+1}^{i}-\bar{x}_{k+1}\|_{2}\leqslant C_{1} \gamma _{k+1}$ by (22). That is, (19) holds for the iteration $k+1$. Hence, $\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant 2C_{1}/(k+2)$ for all $k\geqslant 1$.

Next, we prove that $\|\bar{x}^{i}_{k+1}-\bar{x}^{i}_{k}\|\leqslant \frac{2(d+2C_{1})}{k+2}$. From the definition of $\bar{x}_{k}^{i}$, we have

$$\begin{aligned} & \bigl\Vert \bar{x}_{k+1}^{i}-\bar{x}_{k}^{i} \bigr\Vert \\ &\quad \leqslant \sum_{j=1}^{N}w_{ij} \bigl( \bigl\Vert x_{k+1}^{j}-\bar{x}_{k}^{j} \bigr\Vert + \bigl\Vert \bar{x}_{k}^{j}- \bar{x}_{k}^{i} \bigr\Vert \bigr) \\ &\quad \overset{(\mathrm{a})}{=}\sum_{j=1}^{N}w_{ij} \bigl( \bigl\Vert \gamma _{k} z_{k}^{j}- \gamma _{k}\bar{x}_{k}^{j} \bigr\Vert + \bigl\Vert \bar{x}_{k}^{j}-\bar{x}_{k}+ \bar{x}_{k}- \bar{x}_{k}^{i} \bigr\Vert \bigr) \\ &\quad \overset{(\mathrm{b})}{\leqslant}\sum_{j=1}^{N}w_{ij} \bigl( \bigl\Vert \bar{x}_{k}^{j}- \bar{x}_{k} \bigr\Vert + \bigl\Vert \bar{x}_{k}^{i}- \bar{x}_{k} \bigr\Vert \bigr) \\ &\qquad {}+\sum_{j=1}^{N}w_{ij} \bigl\Vert \gamma _{k} \bigl(z_{k}^{j}- \bar{x}_{k}^{j} \bigr) \bigr\Vert \\ &\quad \overset{(\mathrm{c-})}{\leqslant}\sum_{j=1}^{N}w_{ij}( \gamma _{k} d+2C_{1} \gamma _{k}) = \frac{2(d+2C_{1})}{k+2}, \end{aligned}$$

where (a) holds for (7); (b) is due to the triangle inequality; (c) follows from Assumption 4. □

1.3 1.3 Technique lemmas

The following two Lemmas provide the bounds of $\mathbb{E}[\|g^{i}_{k}\|]$ and $\mathbb{E}[\|g^{i}_{k}\|^{2}]$ in Algorithm 1.

Lemma 7

Choose $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$. If Assumptions 1–5hold, then, for any $k\geqslant 1$ and $i\in \mathcal{N}$,

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert \bigr]\leqslant \psi _{1}, \end{aligned}$$

(17)

where $\psi _{1}=\max \{\|g_{1}^{i}\|,2l+5L(d+2C_{1})\}$ and $C_{1}=k_{0}\sqrt{N}d$.

Proof

Obviously, (23) holds for $k=1$. We discuss the case when $k>1$ in the next step. It follows from the update (3) that

$$\begin{aligned} & \mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert \bigr] \\ &\quad =\mathbb{E} \bigl[ \bigl\Vert (1 - \beta _{k})g^{i}_{k - 1} + \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {}- (1 - \beta _{k})\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k - 1},\xi ^{i}_{k} \bigr) \bigr\Vert \bigr] \\ &\quad \leqslant (1-\beta _{k})\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k-1} \bigr\Vert \bigr] \\ &\qquad {}+\beta _{k} \mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \\ & \qquad {}+\nabla h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \bigr\Vert \bigr] \\ &\qquad {}+\mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \bigr] \\ &\quad \overset{(\mathrm{a})}{\leqslant} (1-\beta _{k})\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k-1} \bigr\Vert \bigr]+ \beta _{k}l+\sqrt{n}L\rho _{k-1}\beta _{k} \\ &\qquad {}+\mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {} +\nabla h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr)-\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)+ \nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {}- \nabla h_{i} \bigl( \bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \bigr\Vert \bigr] \\ &\quad \overset{(\mathrm{b})}{\leqslant}(1 - \beta _{k})\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k-1} \bigr\Vert \bigr] + \beta _{k}l + \sqrt{n}L\rho _{k-1}\beta _{k} + \sqrt{n}L \rho _{k} \\ &\qquad {}+ \sqrt{n}L\rho _{k-1} + L\mathbb{E} \bigl[ \bigl\Vert \bar{x}^{i}_{k} - \bar{x}^{i}_{k-1} \bigr\Vert \bigr] \\ &\quad \overset{(\mathrm{c-})}{\leqslant}(1-\beta _{k})\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k-1} \bigr\Vert \bigr]+ \beta _{k}l+ \sqrt{n}L\rho _{k-1} \\ &\qquad {}+2\sqrt{n}L\rho _{k-1}+\gamma _{k-1}L(d+2C_{1}), \end{aligned}$$

where we use Facts 2 and 3 in (a); (b) holds by Fact 3 and the smoothness of $h_{i}$ in Assumption 3; (c) holds because of the fact $\rho _{k}\leqslant \rho _{k-1}$, $\beta _{k}<1$, and Lemma 1. Taking $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, and $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, we yield that

$$\begin{aligned} &\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert \bigr]\leqslant \biggl(1 - \frac{2}{k + 1} \biggr) \mathbb{E} \bigl[ \bigl\Vert g^{i}_{k - 1} \bigr\Vert \bigr]+\frac{2l+5L(d+2C_{1})}{k+1}. \end{aligned}$$

Using Lemma 5 with $t_{0}=1$, $a_{1}=a_{2}=1$, $A_{1}=2$, and $A_{2}=2nl+4L(d+2C_{1})$, we yield that $\mathbb{E}[\|g^{i}_{k}\|]\leqslant \psi _{1}=\max \{\|g_{1}^{i} \|,2l+5L(d+2C_{1})\}$. □

Lemma 8

Suppose Assumptions 1–5hold. Choose $\gamma _{k}=\frac{2}{k+2}$, $\beta _{k}=\frac{2}{k+1}$, and $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$. Then, for any $k\geqslant 1$ and $i\in \mathcal{N}$,

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \bigr]\leqslant \psi _{2}, \end{aligned}$$

(18)

where $\psi _{2}=\max \{\|g^{i}_{1}\|^{2},10L\psi _{1}(d+2C_{1})+28L^{2}(d+2C_{1})^{2}+8l^{2}+4l \psi _{1}\}$.

Proof

Obviously, (24) holds for $k=1$. Next, we consider the case when $k>1$. It follows from the update (3) that

$$\begin{aligned} \bigl\Vert g^{i}_{k} \bigr\Vert ^{2}\leqslant{}& (1 - \beta _{k})^{2} \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2}+2(1 - \beta _{k}) \bigl\Vert g^{i}_{k-1} \bigr\Vert \\ &{}\times \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-(1-\beta _{k})\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k - 1}, \xi ^{i}_{k} \bigr) \bigr\Vert \\ &{} + \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-(1-\beta _{k}) \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \\ \leqslant{}& (1-\beta _{k})^{2} \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2}+2(1-\beta _{k}) \bigl\Vert g^{i}_{k-1} \bigr\Vert \\ &{}\times \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \\ &{} + \beta _{k}\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k - 1}, \xi ^{i}_{k} \bigr) \bigr\Vert + \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &{} - \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr)+ \beta _{k} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \\ \leqslant{}& (1-\beta _{k}) \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2} \\ &{} +2 \bigl\Vert g^{i}_{k-1} \bigr\Vert \bigl( \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \\ &{} +\beta _{k} \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \bigr) \\ &{} +2 \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)- \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \\ &{} +2\beta _{k}^{2} \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2}, \end{aligned}$$

(19)

where we use the fact $1-\beta _{k}\leqslant 1$ and the triangle inequality in the last inequality. Next, we concentrate on the term $\|\hat{\nabla}h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})-\hat{\nabla}h_{i}( \bar{x}^{i}_{k-1},\xi ^{i}_{k})\|$ on the RHS of (25). Introducing $\nabla h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})-\nabla h_{i}(\bar{x}^{i}_{k-1}, \xi ^{i}_{k})$, we have

$$\begin{aligned} & \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \\ &\quad = \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-\nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)+ \nabla h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \\ & \qquad {}-\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr)+\nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \\ &\quad \leqslant \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert + \bigl\Vert \nabla h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \\ & \qquad {}-\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \bigr\Vert + \bigl\Vert \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)-\nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \\ &\quad \leqslant \sqrt{n}L\rho _{k}+\sqrt{n}L\rho _{k-1}+L \bigl\Vert \bar{x}^{i}_{k}- \bar{x}^{i}_{k-1} \bigr\Vert \\ &\quad \leqslant 2\sqrt{n}L\rho _{k-1}+L(d+2C_{1})\gamma _{k-1}, \end{aligned}$$

(20)

where we use the triangle inequality, Fact 3, and Lemma 1 to obtain the result. Substituting (26) into (25), and taking the conditional expectation on $\mathcal{F}_{k}$, we therefore have that

$$\begin{aligned} \mathbb{E}_{k} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \bigr]\leqslant{}& (1-\beta _{k}) \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2} \\ &{} + \bigl(4 \sqrt{n}L\rho _{k-1}+2L(d+2C_{1})\gamma _{k-1} \bigr) \bigl\Vert g^{i}_{k-1} \bigr\Vert \\ &{} +2 \bigl(2\sqrt{n}L\rho _{k-1}+L(d+2C_{1})\gamma _{k-1} \bigr)^{2} \\ &{} +2\beta _{k}^{2} \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \\ & {}-\nabla h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)+\nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &{} +2\beta _{k} \bigl\Vert g^{i}_{k-1} \bigr\Vert \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \\ &{} -\nabla h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)+\nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert \bigr] \\ \leqslant{}& (1-\beta _{k}) \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2} \\ &{} + \bigl(4\sqrt{n}L\rho _{k-1}+2L(d+2C_{1}) \gamma _{k-1} \bigr) \bigl\Vert g^{i}_{k-1} \bigr\Vert \\ &{} +16nL^{2}\rho ^{2}_{k-1}+4L^{2}(d+ 2C_{1})^{2} \gamma ^{2}_{k - 1} \\ &{} + 4nL^{2}\rho ^{2}_{k-1}\beta ^{2}_{k} +4l^{2}\beta ^{2}_{k} \\ &{} +2\sqrt{n}L \bigl\Vert g^{i}_{k-1} \bigr\Vert \beta _{k}\rho _{k-1}+2l \bigl\Vert g^{i}_{k-1} \bigr\Vert \beta _{k}, \end{aligned}$$

(21)

where the last inequality is due to (13), Fact 2, and Fact 3. Taking the full expectation on both sides of (27) and taking $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, it follows from (23) that

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \bigr]\leqslant{}& (1 - \beta _{k})\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k-1} \bigr\Vert ^{2} \bigr] + 4 \sqrt{n}L\psi _{1}\rho _{k-1} \\ &{} +2L(d+2C_{1}) \psi _{1}\gamma _{k-1}+8nL^{2} \rho ^{2}_{k - 1} \\ & {}+ 4L^{2}(d + 2C_{1})^{2}\gamma _{k - 1}+4nL^{2}\rho ^{2}_{k-1}+4l^{2} \beta _{k} \\ &{} +2\sqrt{n}L\psi _{1}\rho _{k - 1}+2l\psi _{1} \beta _{k} \\ \leqslant{}& \biggl(1 - \frac{2}{k + 1} \biggr)\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k - 1} \bigr\Vert ^{2} \bigr] + \frac{10L\psi _{1}(d + 2C_{1})}{k + 1} \\ &{} + \frac{28L^{2}(d + 2C_{1})^{2} + 8l^{2} + 4l\psi _{1}}{k+1}. \end{aligned}$$

Using Lemma 5 with $t_{0}=1$, $a_{1}=a_{2}=1$, $A_{1}=2$, and $A_{2}=10L\psi _{1}(d+2C_{1})+28L^{2}(d+2C_{1})^{2}+8l^{2}+4l\psi _{1}$, we prove the result. □

1.4 1.4 Proof of Lemma 2

Proof

To obtain the result in (14), we prove that

$$\begin{aligned} \mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert s^{i}_{k}-\bar{g}_{k} \bigr\Vert ^{2} \Biggr]\leqslant C_{2} \gamma ^{2}_{k} \end{aligned}$$

(22)

by using induction on k. Firstly, we prove that (28) holds if $1\leqslant k\leqslant k_{0}-2$. It follows from the updates (4) and (5) that $s^{i}_{k}=y^{i}_{k+1}-g^{i}_{k+1}+g^{i}_{k}$. We have

$$\begin{aligned} &\mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert s^{i}_{k} - \bar{g}_{k} \bigr\Vert ^{2} \Biggr] \\ &\quad = \mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert y^{i}_{k+1} - g^{i}_{k+1} + g^{i}_{k} - \bar{g}_{k} \bigr\Vert ^{2} \Biggr] \\ &\quad \leqslant \mathbb{E} \Biggl[\sum_{i=1}^{N} \Biggl(4 \bigl\Vert y^{i}_{k+1} \bigr\Vert ^{2}+4 \bigl\Vert g^{i}_{k+1} \bigr\Vert ^{2}+4 \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \\ &\qquad {}+4 \Biggl\Vert \frac{1}{N}\sum_{j=1}^{N}g^{j}_{k} \Biggr\Vert ^{2} \Biggr) \Biggr] \\ &\quad \leqslant 12N\psi _{2}+4\sum_{i=1}^{N} \mathbb{E} \bigl[ \bigl\Vert y^{i}_{k+1} \bigr\Vert ^{2} \bigr], \end{aligned}$$

(23)

where we use (13) in the first inequality and (24) in the last inequality. Next, we focus on the second term of the RHS of (29). It follows from the update (4) that

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert y^{i}_{k+1} \bigr\Vert ^{2} \bigr]&=\mathbb{E} \Biggl[ \Biggl\Vert \sum _{j=1}^{N}w_{ij}y^{j}_{k}+g^{i}_{k+1}-g^{i}_{k} \Biggr\Vert ^{2} \Biggr] \\ &\overset{(\mathrm{a})}{\leqslant}\mathbb{E} \Biggl[3\sum_{j=1}^{N}w_{ij} \bigl\Vert y^{j}_{k} \bigr\Vert ^{2}+3 \bigl\Vert g^{i}_{k+1} \bigr\Vert ^{2}+3 \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \Biggr] \\ &\overset{(\mathrm{b})}{\leqslant} 3\sum_{j=1}^{N}w_{ij} \mathbb{E} \bigl[ \bigl\Vert y^{j}_{k} \bigr\Vert ^{2} \bigr]+6\psi _{2} \\ &\overset{(\mathrm{c-})}{\leqslant} 3^{k-2} \bigl(18l^{2}+L^{2}(d+2C_{1})^{2} \bigr)+6k3^{k-1} \psi _{2}, \end{aligned}$$

where (a) holds because of (13) and the Jensen’s inequality; (b) follows from (24); (c) is due to the fact that $\mathbb{E}[\|y^{j}_{1}\|^{2}]=\mathbb{E}[\|\hat{\nabla}h_{i}(\bar{x}^{i}_{1}, \xi ^{i}_{1})-\nabla h_{i}(\bar{x}^{i}_{1},\xi ^{i}_{1})+\nabla h_{i}( \bar{x}^{i}_{1},\xi ^{i}_{1})\|^{2}]\leqslant 2l^{2}+L^{2}(d+2C_{1})^{2}/9$. We therefore yield that

$$\begin{aligned}& \mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert \bar{g}_{k}-s^{i}_{k} \bigr\Vert ^{2} \Biggr] \\& \quad \leqslant 12N \psi _{2}+4 \bigl(3^{k-2} \bigr) \bigl(18l^{2}+L^{2}(d+2C_{1})^{2} \bigr)N \\& \qquad {}+24k3^{k-1}N\psi _{2} \\& \quad \leqslant 12N\psi _{2}+ \bigl(4^{k_{0}-3} \bigr) \bigl(18l^{2}+L^{2}(d+2C_{1})^{2} \bigr)N \\& \qquad {}+8(k_{0}-2)3^{k_{0}-2}N \psi _{2} \\& \quad \leqslant 12N\psi _{2}+4^{k_{0}-1}N \bigl(2l^{2}+L^{2}(d+2C_{1})^{2}+2(k_{0}-2) \psi _{2} \bigr) \end{aligned}$$

for $k< k_{0}-2$. Obviously, (28) is true when $k\in \{1,k_{0}-2\}$.

For induction step, we assume that (28) is true for some $k\geqslant k_{0}-2$. For convenience of analysis, we define $\Delta g^{i}_{k+1}:=g^{i}_{k+1}-g^{i}_{k}$ and $\Delta \hat{g}_{k+1}:=\bar{g}_{k+1}-\bar{g}_{k}$. According to the update (4), we have that $y^{i}_{k+1}=\Delta g^{i}_{k+1}+s^{i}_{k}$. Further, it follows from Fact 1 and Lemma 6 (a) that

$$\begin{aligned} &\mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert s^{i}_{k+1}-\bar{g}_{k+1} \bigr\Vert ^{2} \Biggr] \\ &\quad \leqslant \mathbb{E} \Biggl[ \vert \lambda \vert ^{2}\sum _{i=1}^{N} \bigl\Vert y^{i}_{k+1}- \bar{g}_{k+1} \bigr\Vert ^{2} \Biggr] \\ &\quad =\mathbb{E} \Biggl[ \vert \lambda \vert ^{2}\sum _{i=1}^{N} \bigl\Vert \Delta g^{i}_{k+1}+s^{i}_{k}- \bar{g}_{k+1} \bigr\Vert ^{2} \Biggr]. \end{aligned}$$

(24)

Next, we focus on the RHS of (30). It follows from the definitions of $\Delta g^{i}_{k+1}$ and $\Delta \hat{g}_{k+1}$ that

$$\begin{aligned} & \sum_{i=1}^{N} \bigl\Vert \Delta g^{i}_{k+1}+s^{i}_{k}- \bar{g}_{k+1} \bigr\Vert ^{2} \\ &\quad =\sum_{i=1}^{N} \bigl\Vert s^{i}_{k}-\bar{g}_{k}+\Delta g^{i}_{k+1}- \Delta \hat{g}_{k+1} \bigr\Vert ^{2} \\ &\quad \leqslant \sum_{i=1}^{N} \bigl( \bigl\Vert s^{i}_{k} - \bar{g}_{k} \bigr\Vert ^{2} + \bigl\Vert \Delta g^{i}_{k+1} - \Delta \hat{g}_{k+1} \bigr\Vert ^{2} \\ &\qquad {}+2 \bigl\Vert s^{i}_{k} - \bar{g}_{k} \bigr\Vert \bigl\Vert \Delta g^{i}_{k+1} - \Delta \hat{g}_{k+1} \bigr\Vert \bigr). \end{aligned}$$

(25)

Recall the definition of $\Delta g^{i}_{k+1}$ and the update (3). We bound $\mathbb{E}[\|\Delta g^{i}_{k+1}\|^{2}]$ as

$$\begin{aligned} & \mathbb{E} \bigl[ \bigl\Vert \Delta g^{i}_{k+1} \bigr\Vert ^{2} \bigr] \\ &\quad =\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k+1}-g^{i}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad =\mathbb{E} \bigl[ \bigl\Vert \beta _{k+1} \bigl(\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr)-g^{i}_{k} \bigr)+ \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k+1},\xi ^{i}_{k+1} \bigr) \\ &\qquad {}-\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k+1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 3\beta ^{2}_{k+1}\mathbb{E} \bigl[ \bigl\Vert g^{i}_{k} \bigr\Vert ^{2} \bigr]+3\beta ^{2}_{k+1} \mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr) \\ &\qquad {}-\nabla h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k+1} \bigr)+\nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {} +3\mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k+1},\xi ^{i}_{k+1} \bigr)- \nabla h_{i} \bigl(\bar{x}^{i}_{k+1},\xi ^{i}_{k+1} \bigr) \\ &\qquad {}+\nabla h_{i} \bigl(\bar{x}^{i}_{k}, \xi ^{i}_{k+1} \bigr)-\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr) \\ &\qquad {} +\nabla h_{i} \bigl(\bar{x}^{i}_{k+1}, \xi ^{i}_{k+1} \bigr)-\nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad \overset{(\mathrm{a})}{\leqslant}9\mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k + 1},\xi ^{i}_{k + 1} \bigr) - \nabla h_{i} \bigl(\bar{x}^{i}_{k+1},\xi ^{i}_{k + 1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {} + 9\mathbb{E} \bigl[ \bigl\Vert \nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k + 1} \bigr) - \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k + 1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {} +9\mathbb{E} \bigl[ \bigl\Vert \nabla h_{i} \bigl(\bar{x}^{i}_{k+1},\xi ^{i}_{k+1} \bigr)- \nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k+1} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {}+ \bigl(6l^{2}+3\psi _{2} \bigr) \beta ^{2}_{k+1}+6nL^{2} \rho _{k}^{2} \beta ^{2}_{k+1} \\ &\quad \overset{(\mathrm{b})}{\leqslant}9nL^{2}\rho ^{2}_{k+1}+15nL^{2} \rho ^{2}_{k}+ \bigl(6l^{2}+3 \psi _{2} \bigr)\beta ^{2}_{k+1} \\ &\qquad {}+9L^{2} \bigl\Vert \bar{x}^{i}_{k+1}- \bar{x}^{i}_{k} \bigr\Vert ^{2} \\ &\quad \overset{(\mathrm{c-})}{\leqslant} 24nL^{2}\rho ^{2}_{k} + \bigl(6l^{2} + 3 \psi _{2} \bigr)\beta ^{2}_{k + 1} \\ &\qquad {} + 9 L^{2}(d + 2C_{1})^{2} \gamma ^{2}_{k}, \end{aligned}$$

(26)

where (a) follows from (13), (24), and Fact 2; (b) holds by (11), the smoothness of $h_{i}$, and the fact that $\beta _{k+1}<1$; (c) holds because of the fact that $\rho _{k+1}\leqslant \rho _{k}$ and Lemma 1. Furthermore, we use the result in (32) to yield that

$$\begin{aligned} & \mathbb{E} \bigl[ \bigl\Vert \Delta g^{i}_{k + 1} - \Delta \hat{g}_{k + 1} \bigr\Vert ^{2} \bigr] \\ &\quad =\mathbb{E} \Biggl[ \Biggl\Vert \Delta g^{i}_{k + 1} - \frac{1}{N} \sum_{i=1}^{N}\Delta g^{i}_{k +1 } \Biggr\Vert ^{2} \Biggr] \\ &\quad =\mathbb{E} \biggl[ \biggl\Vert \biggl(1-\frac{1}{N} \biggr) \Delta g^{i}_{k+1}-\frac{1}{N} \sum _{j\neq i}\Delta g^{j}_{k+1} \biggr\Vert ^{2} \biggr] \\ &\quad \leqslant 2 \biggl(1-\frac{1}{N} \biggr)\mathbb{E} \bigl[ \bigl\Vert \Delta g^{i}_{k+1} \bigr\Vert ^{2} \bigr]+ \frac{2}{N}\sum_{j\neq i}\mathbb{E} \bigl[ \bigl\Vert \Delta g^{j}_{k+1} \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 4 \biggl(1-\frac{1}{N} \biggr) \bigl(24nL^{2} \rho ^{2}_{k}+ \bigl(6l^{2}+3\psi _{2} \bigr) \beta ^{2}_{k+1} \\ &\qquad {}+9L^{2}(d+2C_{1})^{2} \gamma ^{2}_{k} \bigr) \\ &\quad \leqslant \frac{4(60L^{2}(d+2C_{1})^{2}+12(2l^{2}+\psi _{2}))}{(k+2)^{2}} \\ &\quad = \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+\psi _{2} \bigr) \bigr)\gamma _{k}^{2}, \end{aligned}$$

(27)

where we use the fact that $1-\frac{1}{N}\leqslant 1$ and the choice of $\rho _{k}$, $\beta _{k}$, $\gamma _{k}$ in the last inequality. Taking full expectation on the RHS and LHS of (31), and then substituting (33) into the result, we obtain

$$\begin{aligned} & \sum_{i=1}^{N}\mathbb{E} \bigl[ \bigl\Vert \Delta g^{i}_{k+1}+s^{i}_{k}- \bar{g}_{k+1} \bigr\Vert ^{2} \bigr] \\ &\quad \overset{(\mathrm{a})}{\leqslant}\mathbb{E} \Biggl[\sum _{i=1}^{N} \bigl\Vert s^{i}_{k}- \bar{g}_{k} \bigr\Vert ^{2} \Biggr] \\ &\qquad {}+N \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+\psi _{2} \bigr) \bigr) \gamma _{k}^{2} \\ & \qquad {}+2\sum_{i=1}^{N} \bigl( \mathbb{E} \bigl[ \bigl\Vert s^{i}_{k}-\bar{g}_{k} \bigr\Vert ^{2} \bigr] \bigr)^{ \frac{1}{2}} \bigl( \mathbb{E} \bigl[ \bigl\Vert \Delta g^{i}_{k+1}-\Delta \hat{g}_{k+1} \bigr\Vert ^{2} \bigr] \bigr)^{ \frac{1}{2}} \\ &\quad \overset{(\mathrm{b})}{\leqslant} C_{2}\gamma _{k}^{2}+N \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+ \psi _{2} \bigr) \bigr)\gamma _{k}^{2} \\ & \qquad {}+2N\gamma ^{2}_{k}\sqrt{ \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+\psi _{2} \bigr) \bigr)C_{2}} \\ &\quad \leqslant \gamma ^{2}_{k} \bigl(C_{2}+N^{2} \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+ \psi _{2} \bigr) \bigr) \\ &\qquad {} +2N\sqrt{ \bigl(60L^{2}(d+2C_{1})^{2}+12 \bigl(2l^{2}+\psi _{2} \bigr) \bigr)C_{2}} \bigr) \\ &\quad \leqslant \gamma ^{2}_{k} \biggl( \sqrt{C_{2}}+ \frac{\sqrt{C_{2}}}{k_{0}} \biggr)^{2}=\gamma ^{2}_{k} \biggl( \frac{k_{0}+1}{k_{0}}\sqrt{C_{2}} \biggr)^{2}, \end{aligned}$$

(28)

where (a) is due to the Hölder’s inequality; (b) follows from the induction hypothesis. Hence, (30) is written as

$$\begin{aligned} \mathbb{E} \Biggl[\sum_{i=1}^{N} \bigl\Vert s^{i}_{k+1}-\bar{g}_{k+1} \bigr\Vert ^{2} \Biggr]& \leqslant \vert \lambda \vert ^{2}\gamma ^{2}_{k} \biggl(\frac{k_{0}+1}{k_{0}} \sqrt{C_{2}} \biggr)^{2} \\ &\leqslant \biggl(\frac{2k_{0}}{(k_{0}+1)(k+2)}\sqrt{C_{2}} \biggr)^{2} \\ &\leqslant \biggl(\frac{2}{k+3}\sqrt{C_{2}} \biggr)^{2}=C_{2} \gamma ^{2}_{k+1}, \end{aligned}$$

where we use the relations $|\lambda |\leqslant [k_{0}/(k_{0}+1)]^{2}$, $\gamma _{k}=2/(k+2)$, and the monotonically increasing property of function $g(x) = x/(1+x)$ with respect to x over $[0,+\infty )$. That is, (28) holds when $k\leftarrow k+1$. The required result is obtained. □

1.5 1.5 Proof of Lemma 3

Proof

1) According to the definition of $\bar{g}_{k}$ and the update (3), we have

$$\begin{aligned} & \bar{g}_{k}-\hat{\nabla}\bar{h}_{k} \\ &\quad =\frac{1}{N}\sum_{i=1}^{N}g^{i}_{k}- \hat{\nabla}\bar{h}_{k} \\ &\quad =(1-\beta _{k})\bar{g}_{k-1}+\frac{1}{N}\sum _{i=1}^{N}\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {}- \hat{\nabla}\bar{h}_{k}-(1-\beta _{k}) \frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr). \end{aligned}$$

Introducing $(1-\beta _{k})\hat{\nabla}\bar{h}_{k-1}$ into the RHS of the above equality and rearranging, we arrive at

$$\begin{aligned} &\bar{g}_{k}-\hat{\nabla}\bar{h}_{k} \\ &\quad ={}(1-\beta _{k}) (\bar{g}_{k-1}- \hat{\nabla}\bar{h}_{k-1}) \\ &\qquad {}+ \frac{1}{N}\sum_{i=1}^{N} \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)-\hat{\nabla}\bar{h}_{k} \\ &\qquad {} +(1-\beta _{k}) \Biggl(\hat{\nabla}\bar{h}_{k-1}- \frac{1}{N}\sum_{i=1}^{N} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \Biggr). \end{aligned}$$

(29)

Taking the squared-norm on RHS and LHS of (35) and then taking conditional expectation on $\mathcal{F}_{k}$, we obtain

$$\begin{aligned} & \mathbb{E}_{k} \bigl[ \Vert \bar{g}_{k}- \hat{\nabla}\bar{h}_{k} \Vert ^{2} \bigr] \\ &\quad =(1-\beta _{k})^{2} \Vert \bar{g}_{k-1}- \hat{\nabla}\bar{h}_{k-1} \Vert ^{2} \\ &\qquad {}+2(1- \beta _{k}) ( \bar{g}_{k-1}-\hat{\nabla}\bar{h}_{k-1}) \\ &\qquad {}\times \Biggl( \mathbb{E}_{k} \Biggl[\frac{1}{N}\sum _{i=1}^{N}\hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \hat{\nabla}\bar{h}_{k} \Biggr] \\ &\qquad {} + (1 - \beta _{k}) \mathbb{E}_{k} \Biggl[\hat{\nabla}\bar{h}_{k-1} - \frac{1}{N}\sum_{i=1}^{N} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \Biggr] \Biggr) \\ &\qquad {}+ \mathbb{E}_{k} \Biggl[ \Biggl\Vert \frac{1}{N} \sum_{i=1}^{N} \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)-\hat{\nabla}\bar{h}_{k}+(1-\beta _{k}) \\ &\qquad {}\times \Biggl(\hat{\nabla}\bar{h}_{k-1}- \frac{1}{N}\sum _{i=1}^{N} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \Biggr) \Biggr\Vert ^{2} \Biggr] \\ &\quad =(1 - \beta _{k})^{2} \Vert \bar{g}_{k - 1} - \hat{\nabla}\bar{h}_{k - 1} \Vert ^{2} \\ &\qquad {}+ \mathbb{E}_{k} \Biggl[ \Biggl\Vert \frac{1}{N} \sum_{i=1}^{N} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) - \hat{\nabla}\bar{h}_{k} +(1-\beta _{k}) \\ &\qquad {} \times \Biggl(\hat{\nabla}\bar{h}_{k-1}- \frac{1}{N}\sum _{i=1}^{N} \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \Biggr) \Biggr\Vert ^{2} \Biggr], \end{aligned}$$

(30)

where the last equality holds due to the fact that

$$\mathbb{E}_{k} \Biggl[\frac{1}{N}\sum _{i=1}^{N}\hat{\nabla}h_{i}\bigl( \bar{x}^{i}_{k},\xi ^{i}_{k}\bigr)- \hat{\nabla}\bar{h}_{k} \Biggr]=0 $$

and

$$\mathbb{E}_{k} \Biggl[\hat{\nabla}\bar{h}_{k-1}- \frac{1}{N}\sum_{i=1}^{N} \hat{\nabla}h_{i}\bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k}\bigr) \Biggr]=0. $$

Next, we focus on the last term of the RHS of (36) and bounding it separately. For convenience, we define $\mathbb{U}_{k}:=\beta _{k} (\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}h_{i}( \bar{x}^{i}_{k},\xi ^{i}_{k})-\hat{\nabla}\bar{h}_{k} )$ and $\mathbb{V}_{k}:=\frac{(1-\beta _{k})}{N}\sum_{i=1}^{N}(\hat{\nabla}h_{i} ( \bar{x}^{i}_{k},\xi ^{i}_{k})) - \hat{\nabla}h_{i}(\bar{x}^{i}_{k-1}, \xi ^{i}_{k}))$. Hence, the second term of the RHS of (36) is modified as $\mathbb{E}_{k}[\|\mathbb{U}_{k}+\mathbb{V}_{k}-\mathbb{E}_{k}[ \mathbb{V}_{k}]\|^{2}]$ and bounded by

$$\begin{aligned} &\mathbb{E}_{k} \bigl[ \bigl\Vert \mathbb{U}_{k}+\mathbb{V}_{k}-\mathbb{E}_{k}[ \mathbb{V}_{k}] \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 2 \mathbb{E}_{k} \bigl[ \Vert \mathbb{U}_{k} \Vert ^{2} \bigr]+2 \mathbb{E}_{k} \bigl[ \bigl\Vert \mathbb{V}_{k}-\mathbb{E}_{k}[\mathbb{V}_{k}] \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 2\mathbb{E}_{k} \bigl[ \Vert \mathbb{U}_{k} \Vert ^{2} \bigr]+2\mathbb{E}_{k} \bigl[2 \Vert \mathbb{V}_{k} \Vert ^{2}+2\mathbb{E}_{k} \bigl[ \Vert \mathbb{V}_{k} \Vert ^{2} \bigr] \bigr] \\ &\quad \leqslant 2\mathbb{E}_{k} \bigl[ \Vert \mathbb{U}_{k} \Vert ^{2} \bigr]+8\mathbb{E}_{k} \bigl[ \Vert \mathbb{V}_{k} \Vert ^{2} \bigr], \end{aligned}$$

(31)

where we use (13) and the Jensen’s inequality.

Next, we will derive the bounds of $\mathbb{E}_{k}[\|\mathbb{U}_{k}\|^{2}]$ and $\mathbb{E}_{k}[\|\mathbb{V}_{k}\|^{2}]$. Obviously, $\hat{\nabla}\bar{h}_{t}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{k}[ \hat{\nabla}h_{i}(\bar{x}^{i}_{t},\xi ^{i}_{k})]=\frac{1}{N} \sum_{i=1}^{N} \hat{\nabla}H_{i}(\bar{x}^{i}_{t})$. It follows from the definitions of $\mathbb{U}_{k}$ and $\hat{\nabla}\bar{h}_{k}$ that

$$\begin{aligned} & \mathbb{E}_{k} \bigl[ \Vert \mathbb{U}_{k} \Vert ^{2} \bigr] \\ &\quad \leqslant \frac{\beta _{k}^{2}}{N}\sum _{i=1}^{N}\mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k}, \xi ^{i}_{k} \bigr)-\hat{\nabla}H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad =\frac{\beta _{k}^{2}}{N}\sum_{i=1}^{N} \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)+ \nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {}+ \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr)-\hat{\nabla}H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \\ &\qquad {} -\nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)-\nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad \overset{(\mathrm{a})}{\leqslant}\frac{\beta _{k}^{2}}{N}\sum_{i=1}^{N} \bigl(3 \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {}+3 \mathbb{E}_{k} \bigl[ \bigl\Vert \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr)-\hat{\nabla}H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {} +3\mathbb{E}_{k} \bigl[ \bigl\Vert \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \bigr) \\ &\quad \overset{(\mathrm{b})}{\leqslant}6nL^{2} \rho ^{2}_{k}+3\delta ^{2}\beta _{k}^{2}, \end{aligned}$$

(32)

where (a) holds because of using the inequality (13); (b) follows from Fact 3 and the fact that $\beta _{k}\leqslant 1$. Similarly, it follows from (13), Fact 3, and the smoothness of $h_{i}$ that

$$\begin{aligned} & \mathbb{E}_{k} \bigl[ \Vert \mathbb{V}_{k} \Vert ^{2} \bigr] \\ &\quad \leqslant \frac{(1-\beta _{k})^{2}}{N}\sum^{N}_{i=1} \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)- \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad =\frac{(1-\beta _{k})^{2}}{N}\sum^{N}_{i=1} \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl(\bar{x}^{i}_{k-1}, \xi ^{i}_{k} \bigr) \\ &\qquad {}+\nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \\ &\qquad {} -\nabla h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)+\nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr)- \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant \frac{(1 - \beta _{k})^{2}}{N}\sum^{N}_{i=1} \bigl(3 \mathbb{E}_{k} \bigl[ \bigl\Vert \hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \\ &\qquad {}- \nabla h_{i} \bigl(\bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr]+3 \mathbb{E}_{k} \bigl[ \bigl\Vert \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \\ &\qquad {} -\hat{\nabla}h_{i} \bigl(\bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\qquad {}+3 \mathbb{E}_{k} \bigl[ \bigl\Vert \nabla h_{i} \bigl( \bar{x}^{i}_{k-1},\xi ^{i}_{k} \bigr)- \nabla h_{i} \bigl( \bar{x}^{i}_{k},\xi ^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \bigr) \\ &\quad \leqslant \frac{(1 - \beta _{k})^{2}}{N}\sum^{N}_{i=1} \bigl(3nL^{2} \rho ^{2}_{k-1} + 3nL^{2} \rho ^{2}_{k} \\ &\qquad {}+ 3L^{2} \bigl\Vert \bar{x}^{i}_{k-1} - \bar{x}^{i}_{k} \bigr\Vert ^{2} \bigr) \\ &\quad \leqslant 6nL^{2}\rho ^{2}_{k-1}+3L^{2} \gamma ^{2}_{k-1}(d+2C_{1})^{2}, \end{aligned}$$

(33)

where we use Lemma 1 and the fact $\rho _{k}\leqslant \rho _{k-1}$, and drop the factor $(1-\beta _{k})^{2}$ in the last inequality. Substituting (38) and (39) into (37), we obtain

$$\begin{aligned} & \mathbb{E}_{k} \bigl[ \bigl\Vert \mathbb{U}_{k}+ \mathbb{V}_{k}-\mathbb{E}_{k}[ \mathbb{V}_{k}] \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 12nL^{2}\rho ^{2}_{k}+6\delta ^{2}\beta _{k}^{2}+48nL^{2} \rho ^{2}_{k-1} \\ &\qquad {}+24L^{2}\gamma ^{2}_{k-1}(d+2C_{1})^{2} \\ &\quad \leqslant 60nL^{2}\rho ^{2}_{k-1}+6\delta ^{2}\beta _{k}^{2}+24L^{2} \gamma ^{2}_{k-1}(d+2C_{1})^{2}. \end{aligned}$$

Substituting the above result into (36), we yield Eqn. (15).

2) Taking $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$ and $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, we rewrite (15) as

$$\begin{aligned} & \mathbb{E} \bigl[ \Vert \bar{g}_{k}-\hat{\nabla}\bar{h}_{k} \Vert ^{2} \bigr] \\ &\quad \leqslant \biggl(1 - \frac{2}{k+1} \biggr)\mathbb{E} \bigl[ \Vert \bar{g}_{k-1} - \hat{\nabla}\bar{h}_{k-1} \Vert ^{2} \bigr] + \frac{60L^{2}(d+2C_{1})^{2}}{(k+1)^{2}} \\ &\qquad {} +\frac{24\delta ^{2}}{(k+1)^{2}}+ \frac{96L^{2}(d+2C_{1})^{2}}{(k+1)^{2}} \\ &\quad = \biggl(1 - \frac{2}{k + 1} \biggr)\mathbb{E} \bigl[ \Vert \bar{g}_{k - 1} - \hat{\nabla}\bar{h}_{k - 1} \Vert ^{2} \bigr] \\ &\qquad {} + \frac{156L^{2}(d + 2C_{1})^{2} + 24\delta ^{2}}{(k+1)^{2}}. \end{aligned}$$

Using Lemma 5 with $a_{1}=1$, $t_{0}=1$, $a_{2}=2$, $A_{1}=2$ and $A_{2}=156L^{2}(d+2C_{1})^{2}+24\delta ^{2}$, we yield that $\mathbb{E}[\|\bar{g}_{k}-\hat{\nabla}\bar{h}_{k}\|^{2}]\leqslant \frac{C_{3}}{k+2} $, where $C_{3}:=\max \{2\|\bar{g}_{1}-\hat{\nabla}h(x_{1})\|^{2},156L^{2}(d+2C_{1})^{2}+24 \delta ^{2}\}$. Focusing on $\mathbb{E}[\|\bar{g}_{k}-\bar{p}_{k}\|^{2}]$ and introducing $\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}H_{i}(\bar{x}^{i}_{k})$, according to the definition of $\bar{p}_{k}$ and the relation $\hat{\nabla}\bar{h}_{k}=\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}H_{i}( \bar{x}^{i}_{k})$, we have

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert \bar{g}_{k}-\bar{p}_{k} \Vert ^{2} \bigr] \\ &\quad = \mathbb{E} \Biggl[ \Biggl\Vert \bar{g}_{k} - \hat{\nabla}\bar{h}_{k} + \frac{1}{N} \sum _{i=1}^{N} \hat{\nabla}H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \\ &\qquad {}- \frac{1}{N} \sum _{i=1}^{N} \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \Biggr\Vert ^{2} \Biggr] \\ &\quad \leqslant 2\mathbb{E} \bigl[ \Vert \bar{g}_{k} - \hat{\nabla}\bar{h}_{k} \Vert ^{2} \bigr] \\ &\qquad {}+ \frac{2}{N} \sum ^{N}_{i=1} \mathbb{E} \bigl[ \bigl\Vert \hat{\nabla}H_{i} \bigl( \bar{x}^{i}_{k} \bigr) - \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant \frac{2C_{3}+2L^{2}(d+2C_{1})^{2}}{k+2}, \end{aligned}$$

where we use (13) in the first inequality, and the last inequality holds due to (12). □

1.6 1.6 Proof of Lemma 4

Proof

Focusing on the LHS of (17), adding and subtracting the term $(\bar{p}_{k}+\bar{g}_{k})$ into $\|\nabla h(\bar{x}_{k})-s^{i}_{k}\|^{2}$, we have

$$\begin{aligned} & \mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad =\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-\bar{p}_{k}+\bar{p}_{k}-\bar{g}_{k}+ \bar{g}_{k}-s^{i}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad \leqslant 3\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-\bar{p}_{k} \bigr\Vert ^{2} \bigr]+3 \mathbb{E} \bigl[ \Vert \bar{p}_{k}-\bar{g}_{k} \Vert ^{2} \bigr] \\ &\qquad {}+3 \mathbb{E} \bigl[ \bigl\Vert \bar{g}_{k}-s^{i}_{k} \bigr\Vert ^{2} \bigr], \end{aligned}$$

(34)

where the last inequality follows from (13). The first term of the RHS of (40) is rewritten as

$$\begin{aligned} &3\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})- \bar{p}_{k} \bigr\Vert ^{2} \bigr] \\ &\quad =3\mathbb{E} \Biggl[ \Biggl\Vert \frac{1}{N}\sum_{i=1}^{N} \bigl(\nabla H_{i}(\bar{x}_{k})- \nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr) \Biggr\Vert ^{2} \Biggr] \\ &\quad \leqslant 3\mathbb{E} \Biggl[ \Biggl(\frac{1}{N}\sum _{i=1}^{N} \bigl\Vert \nabla H_{i}( \bar{x}_{k})-\nabla H_{i} \bigl(\bar{x}^{i}_{k} \bigr) \bigr\Vert \Biggr)^{2} \Biggr] \\ &\quad \leqslant \frac{3L^{2}}{N}\sum_{i=1}^{N} \bigl\Vert \bar{x}_{k}-\bar{x}_{k}^{i} \bigr\Vert ^{2} \\ &\quad \leqslant 3L^{2}C^{2}_{1} \gamma _{k}^{2}, \end{aligned}$$

(35)

where we use the smoothness of $h_{i}$ and Lemma 1. Substituting (41), (16) and (14) into (40) and taking $\beta _{k}=\frac{2}{k+1}$, $\gamma _{k}=\frac{2}{k+2}$, $\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}$, we have the result. □

1.7 1.7 Proof of Theorem 1

Proof

It follows from the update (7) in Algorithm 1 and Assumption 3 that

$$\begin{aligned} h(\bar{x}_{k+1})&\leqslant h(\bar{x}_{k})+ \gamma _{k} \bigl\langle \nabla h( \bar{x}_{k}),z_{k} - \bar{x}_{k} \bigr\rangle +\frac{L\gamma _{k}^{2}}{2} \Vert z_{k}- \bar{x}_{k} \Vert ^{2} \\ &\leqslant h(\bar{x}_{k})+\gamma _{k} \bigl\langle \nabla h( \bar{x}_{k}),z_{k}- \bar{x}_{k} \bigr\rangle + \frac{L\gamma _{k}^{2}d^{2}}{2}, \end{aligned}$$

(36)

where we use Assumption 4 in the last inequality. Focusing on the second term of the RHS of (42) and the definition of $\bar{z}_{k}$, we have

$$\begin{aligned} &\bigl\langle \nabla h(\bar{x}_{k}),z_{k}-\bar{x}_{k} \bigr\rangle \\ &\quad =\frac{1}{N} \sum _{i=1}^{N} \bigl\langle \nabla h(\bar{x}_{k}),z^{i}_{k}-\bar{x}_{k} \bigr\rangle \\ &\quad =\frac{1}{N}\sum_{i=1}^{N} \bigl\langle \nabla h(\bar{x}_{k})-s^{i}_{k}+s^{i}_{k},z^{i}_{k}- \bar{x}_{k} \bigr\rangle \\ &\quad \leqslant \frac{1}{N}\sum_{i=1}^{N} \bigl[ \bigl\langle \nabla h(\bar{x}_{k})-s^{i}_{k},z^{i}_{k}- \bar{x}_{k} \bigr\rangle + \bigl\langle s^{i}_{k}, x^{*}-\bar{x}_{k} \bigr\rangle \bigr], \end{aligned}$$

where we use the optimality of $z^{i}_{k}$ in the last inequality. Adding and subtracting the term $\frac{1}{N}\sum_{i=1}^{N}[\langle \nabla h(\bar{x}_{k})-s^{i}_{k},x^{*} \rangle ]$ into the RHS of the above inequality, we arrive at

$$\begin{aligned} &\bigl\langle \nabla h(\bar{x}_{k}),z_{k}- \bar{x}_{k} \bigr\rangle \\ &\quad \leqslant \frac{1}{N}\sum _{i=1}^{N} \bigl[ \bigl\langle \nabla h(\bar{x}_{k})-s^{i}_{k},z^{i}_{k}-x^{*} \bigr\rangle + \bigl\langle \nabla h(\bar{x}_{k}),x^{*}-\bar{x}_{k} \bigr\rangle \bigr] \\ &\quad \overset{(\mathrm{a})}{\leqslant}\frac{1}{N}\sum_{i=1}^{N} \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert \bigl\Vert z^{i}_{k}-x^{*} \bigr\Vert +h \bigl(x^{*} \bigr)-h(\bar{x}_{k}) \\ &\quad \overset{(\mathrm{b})}{\leqslant}\frac{1}{N}\sum_{i=1}^{N} \bigl\Vert \nabla h( \bar{x}_{k}) - s^{i}_{k} \bigr\Vert d+h \bigl(x^{*} \bigr) - h(\bar{x}_{k}), \end{aligned}$$

(37)

where (a) holds by the convexity of function $h(x)$; (b) follows from Assumption 4. Substituting (43) into (42), rearranging, and subtracting $h(x^{*})$ from the RHS and LHS of the result, we arrive at

$$\begin{aligned} h(\bar{x}_{k+1})-h \bigl(x^{*} \bigr) \leqslant {}& (1-\gamma _{k}) \bigl(h(\bar{x}_{k})-h \bigl(x^{*} \bigr) \bigr)+ \frac{L\gamma ^{2}_{k}d^{2}}{2} \\ & {}+\frac{d\gamma _{k}}{N}\sum_{i=1}^{N} \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert . \end{aligned}$$

(38)

Taking the expectation on the RHS and LHS of (44), and then using the Jensen’s inequality in the last term of RHS of (44), we have

$$\begin{aligned} &\mathbb{E} \bigl[h(\bar{x}_{k+1}) \bigr]-h \bigl(x^{*} \bigr) \\ &\quad \leqslant (1 - \gamma _{k}) \bigl( \mathbb{E} \bigl[h(\bar{x}_{k}) \bigr] - h \bigl(x^{*} \bigr) \bigr) + \frac{L\gamma ^{2}_{k}d^{2}}{2} \\ &\qquad {} + \frac{d\gamma _{k}}{N}\sum_{i=1}^{N}\sqrt {\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k}) - s^{i}_{k} \bigr\Vert ^{2} \bigr]}. \end{aligned}$$

(39)

It follows from Lemma 4, $\gamma _{k}=\frac{2}{k+2}$, and (45) that

$$\begin{aligned} & \mathbb{E} \bigl[h(\bar{x}_{k+1}) \bigr]-h \bigl(x^{*} \bigr) \\ &\quad \leqslant \biggl(1-\frac{2}{k+2} \biggr) \bigl(\mathbb{E} \bigl[h(\bar{x}_{k}) \bigr]-h \bigl(x^{*} \bigr) \bigr) \\ &\qquad {}+ \frac{2d\sqrt{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}}}{(k+2)^{\frac{3}{2}}} \\ &\qquad {} +\frac{2Ld^{2}}{(k+2)^{\frac{3}{2}}}. \end{aligned}$$

(40)

Using Lemma 5 with

$$\begin{aligned}& A_{1}=2,\\& A_{2}=2Ld^{2}+2d\sqrt{18L^{2}(d+2C_{1})^{2}+12C_{2}+6C_{3}}, \\& t_{0}=2,\qquad a_{1}=1, \quad \text{and}\quad a_{2}=3/2, \end{aligned}$$

we prove the result. □

1.8 1.8 Proof of Theorem 2

Proof

Define $v_{k}\in \operatorname{argmin}_{v\in \mathcal{X}}\langle \nabla h(\bar{x}_{k}),v \rangle $. We have $p_{k}=\langle \nabla h(\bar{x}_{k}),\bar{x}_{k}-v_{k}\rangle $ by (18) and the definition of $v_{k}$. We also obtain from the smoothness property of $h(\cdot )$ that

$$\begin{aligned} h(\bar{x}_{k+1})\leqslant {}&\frac{L}{2} \Vert \bar{x}_{k+1}-\bar{x}_{k} \Vert ^{2}+h( \bar{x}_{k})+ \bigl\langle \nabla h(\bar{x}_{k}),\bar{x}_{k+1}-\bar{x}_{k} \bigr\rangle \\ \overset{(\mathrm{a})}{=}{}&\frac{L\gamma ^{2}_{k}}{2} \Vert \bar{z}_{k}-\bar{x}_{k} \Vert +h( \bar{x}_{k}) \\ & {}+\frac{\gamma _{k}}{N}\sum ^{N}_{i=1} \bigl\langle \nabla h( \bar{x}_{k})+s^{i}_{k}-s^{i}_{k},z^{i}_{k}- \bar{x}_{k} \bigr\rangle \\ \overset{(\mathrm{b})}{\leqslant}{}& \frac{L\gamma ^{2}_{k}}{2} \Vert \bar{z}_{k}- \bar{x}_{k} \Vert ^{2}+h(\bar{x}_{k})+ \frac{\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\langle s^{i}_{k},v_{k}-\bar{x}_{k} \bigr\rangle \\ &{} +\frac{\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\langle \nabla h(\bar{x}_{k})-s^{i}_{k},z^{i}_{k}- \bar{x}_{k} \bigr\rangle , \end{aligned}$$

where (a) holds by Lemma 6(b) and introducing $s^{i}_{k}$; (b) is due to the optimality of $z^{i}_{k}$ in the update (6). It follows from the definition of $p_{k}$ and Assumption 4 that

$$\begin{aligned} h(\bar{x}_{k+1})={}&\frac{L\gamma ^{2}_{k}}{2} \Vert \bar{z}_{k}-\bar{x}_{k} \Vert ^{2}+h( \bar{x}_{k}) \\ & {}+\frac{\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\langle s^{i}_{k}- \nabla h(\bar{x}_{k}),v_{k}- \bar{x}_{k} \bigr\rangle \\ & {}+ \frac{\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\langle \nabla h(\bar{x}_{k}),v_{k} - \bar{x}_{k} \bigr\rangle \\ & {}+\frac{\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\langle \nabla h(\bar{x}_{k})-s^{i}_{k},z^{i}_{k}- \bar{x}_{k} \bigr\rangle \\ \overset{(\mathrm{c-})}{\leqslant}{}&\frac{Ld^{2}\gamma ^{2}_{k}}{2}-\gamma _{k}p_{k}+h( \bar{x}_{k}) \\ & {}+\frac{2d\gamma _{k}}{N}\sum^{N}_{i=1} \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert . \end{aligned}$$

(41)

Taking the full expectation on the RHS and LHS of (47) and using Jensen’s inequality, we yield that

$$\begin{aligned} \mathbb{E} \bigl[h(\bar{x}_{k+1}) \bigr]\leqslant{} & \frac{2d\gamma _{k}}{N}\sum^{N}_{i=1} \sqrt {\mathbb{E} \bigl[ \bigl\Vert \nabla h(\bar{x}_{k})-s^{i}_{k} \bigr\Vert ^{2} \bigr]} \\ & {}+\mathbb{E} \bigl[h( \bar{x}_{k}) \bigr]+\frac{Ld^{2}\gamma ^{2}_{k}}{2}-\gamma _{k}\mathbb{E}[p_{k}] \\ \leqslant{} &\mathbb{E} \bigl[h(\bar{x}_{k}) \bigr] \\ & {}+ \frac{4d\sqrt{18L^{2}(d + 2C_{1})^{2} + 12C_{2} + 6C_{3}}}{(k+2)^{\frac{3}{2}}} \\ &{}-\frac{2\mathbb{E}[p_{k}]}{k+2}+\frac{2Ld^{2}}{(k+2)^{2}}, \end{aligned}$$

(42)

where we use (17) in the last inequality, and substitute $\gamma _{k}=\frac{2}{k+2}$ into the last inequality. Summing the RHS and LHS of (48) from $k=1$ to $k=K$ and rearranging, we have

$$\begin{aligned} \mathbb{E} \Biggl[\sum_{k=1}^{K} \frac{2p_{k}}{k+2} \Biggr]\leqslant{}& h( \bar{x}_{1}) - h(\bar{x}_{K+1}) \\ &{} +\sqrt{18L^{2}(d + 2C_{1})^{2} + 12C_{2} + 6C_{3}} \\ & {}\times \sum_{k=1}^{K} \frac{4d}{(k + 2)^{\frac{3}{2}}} \\ &{} + Ld^{2}\sum_{k=1}^{K} \frac{2}{(k + 2)^{2}}. \end{aligned}$$

(43)

Define m such that $2^{m}=K$, i.e., $m=\log _{2}(K)$. According to the property of p-series, we yield that $m-1\leqslant \sum_{k=1}^{2^{m}}\frac{2}{k+2}$, $\sum_{k=1}^{2^{m}}\frac{2}{(k+2)^{2}}\leqslant 4$, and there are some constant c such that $\sum_{k=1}^{2^{m}}\frac{4d}{(k+2)^{\frac{3}{2}}}\leqslant c$. Hence, we rewrite (49) as

$$\begin{aligned} (m-1)\mathbb{E} \Bigl[\min_{k\in [1,K]}p_{k} \Bigr] \leqslant{}& h(\bar{x}_{1})-h(\bar{x}_{K+1})+4Ld^{2} \\ & {}+ \sqrt{18L^{2}(d + 2C_{1})^{2} + 12C_{2}+6C_{3}}. \end{aligned}$$

By rearranging and substituting $m=\log _{2}(K)$, we obtain the result. □

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hou, J., Zeng, X. & Chen, C. Distributed gradient-free and projection-free algorithm for stochastic constrained optimization. Auton. Intell. Syst. 4, 6 (2024). https://doi.org/10.1007/s43684-024-00062-0

Download citation

Received: 27 November 2023
Revised: 01 April 2024
Accepted: 08 April 2024
Published: 01 May 2024
DOI: https://doi.org/10.1007/s43684-024-00062-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Distributed gradient-free and projection-free algorithm for stochastic constrained optimization

Abstract

Similar content being viewed by others

An Adaptive Stochastic Gradient-Free Approach for High-Dimensional Blackbox Optimization

Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case

A distributed gradient algorithm based on randomized block-coordinate and projection-free over networks

1 Introduction

Notations

2 Problem statement and algorithm design

2.1 Problem statement

2.2 Algorithm design

Remark 1

Remark 2

Remark 3

3 Assumptions and convergence analysis

3.1 Assumptions and facts

Assumption 1

Assumption 2

Fact 1

Assumption 3

Assumption 4

Assumption 5

Fact 2

Fact 3

Proof

Fact 4

3.2 Convergence analysis

Lemma 1

Lemma 2

Lemma 3

Lemma 4

Theorem 1

Theorem 2

Remark 4

Remark 5

4 Numerical simulations

4.1 Black-box binary classification with convex objectives

4.2 Black-box binary classification with nonconvex objectives

5 Conclusions

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

1.1 1.1 Technical lemmas for Lemma 1

Lemma 5

Lemma 6

Proof

1.2 1.2 Proof of Lemma 1

Proof

1.3 1.3 Technique lemmas

Lemma 7

Proof

Lemma 8

Proof

1.4 1.4 Proof of Lemma 2

Proof

1.5 1.5 Proof of Lemma 3

Proof

1.6 1.6 Proof of Lemma 4

Proof

1.7 1.7 Proof of Theorem 1

Proof

1.8 1.8 Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords