A zeroth order method for stochastic weakly convex optimization

Kungurtsev, V.; Rinaldi, F.

doi:10.1007/s10589-021-00313-3

A zeroth order method for stochastic weakly convex optimization

Open access
Published: 01 September 2021

Volume 80, pages 731–753, (2021)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

A zeroth order method for stochastic weakly convex optimization

Download PDF

2477 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we consider stochastic weakly convex optimization problems, however without the existence of a stochastic subgradient oracle. We present a derivative free algorithm that uses a two point approximation for computing a gradient estimate of the smoothed function. We prove convergence at a similar rate as state of the art methods, however with a larger constant, and report some numerical results showing the effectiveness of the approach.

Primal and dual mixed-integer least-squares: distributional statistics and global algorithm

Article Open access 24 June 2024

On the linear convergence rate of Riemannian proximal gradient method

Article 19 June 2024

Recovery of Coefficients in Semilinear Transport Equations

Article 19 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this paper, we study the following class of problems:

$$\begin{aligned} \min _{x\in \mathbb {R}^n} \, \phi (x) := f(x)+r(x), \end{aligned}$$

(1)

with $f(\cdot ):{\mathbb {R}}^n\rightarrow {\mathbb {R}}$ a stochastic, weakly convex, and potentially nonsmooth (i.e., not necessarily continuously differentiable) function, and $r(\cdot ):{\mathbb {R}}^n\rightarrow \bar{{\mathbb {R}}}$ (i.e., it is extended real valued) is convex but not necessarily even continuous, however r(x) satisfies some additional conditions detailed below. Furthermore, we consider the derivative free or zeroth order context, wherein the subgradients $\partial f$, or unbiased estimates thereof, are not available, but only unbiased estimates of function evaluations f(x) are available. We thus write

$$\begin{aligned} f(x)=\mathbb {E}_\xi [F(x;\xi )]=\int _{\Xi } F(x,\xi )dP(\xi ), \end{aligned}$$

with $\{F(\cdot , \xi ), \ \xi \in \Xi \}$ a collection of real valued functions and P a probability distribution over the set $\Xi $ to be precise.

We define two quantitative assumptions regarding $f(\cdot )$ and $r(\cdot )$ below. First, we define the notion of a proximal map, in particular with any constant $\alpha $ and any convex function h we can write $\text{ prox}_{\alpha h}$ to indicate the following function:

$$\begin{aligned} \text{ prox}_{\alpha h}(x)=\displaystyle {{\,\mathrm{argmin}\,}}_y \{h(y)+\frac{1}{2\alpha }\Vert y-x\Vert ^2\}. \end{aligned}$$

The associated optimality condition is

$$\begin{aligned} y = \text{ prox}_{\alpha h}(x) \Longleftrightarrow x-y \in \alpha \partial h (y) \end{aligned}$$

We shall make use of the nonexpansiveness property of the proximal mapping in the sequel,

$$\begin{aligned} \Vert \text{ prox}_{\alpha h}(x)- \text{ prox}_{\alpha h}(y)\Vert \le \Vert x-y\Vert . \end{aligned}$$

We now state our standing assumption on the properties of (1):

Assumption 1

1.
$f(\cdot )$ is $\rho $-weakly convex, i.e., $f(x)+\rho \Vert x\Vert ^2$ is convex for some $\rho >0$, directionally differentiable, bounded below by $f_\star $ and locally Lipschitz with constant $L_0$.
2.
$r(\cdot )$ is convex (but not necessarily continuously differentiable). Furthermore, r(x) is bounded below by $r_\star $.

We shall denote the lower bound of $\phi $ by $\phi _\star =f_\star +r_\star $.

We further assume that the proximal map of r(x) can be evaluated at low computational complexity cost. We note that the $\rho $-weak convexity property for a given function f is equivalent to hypomonotonicity of its subdifferential map, that is

$$\begin{aligned} \langle v-w,x-y\rangle \ge -\rho \Vert x-y\Vert ^2 \end{aligned}$$

(2)

for $v\in \partial f(x)$ and $w\in \partial f(y)$ (see e.g., [1, Example 12.28, p 549]).

The class of weakly convex functions is a special yet very common case of nonconvex functions, which contains all convex (possibly nonsmooth) functions and Lipschitz smooth functions. One standard subset of weakly convex functions is given by the composite function $f(x)=h(c(x))$ where h is nonsmooth and convex and c(x) is continuously differentiable but non-convex (see e.g., [2] and references therein). The additive composite class is another widely used class of weakly convex functions [3], formed from all sums $g(x)+l(x)$ with l closed and convex and g continuously differentiable.

One method for solving a weakly convex stochastic optimization problem is given as repeated iterations of,

$$\begin{aligned} x_{k+1} \in \text {argmin}_y \left\{ f_{x_k}(y;S_k)+r(y)+\frac{1}{2\alpha _k}\Vert y-x_k\Vert ^2\right\} \end{aligned}$$

(3)

where $\alpha _k>0$ is a stepsize sequence, typically taken to satisfy $\alpha _k\rightarrow 0$, and $f_{x_k}(y;S_k)$ is approximating f at $x_k$ using a noisy estimate $S_k$ of the data. A basic stochastic subgradient method will use the linear model

$$\begin{aligned} f_{x_k}(y;S_k) =f(x_k) +\zeta ^T (y-x_k) \end{aligned}$$

where $\zeta \approx \bar{\zeta }\in \partial f(x_k)$. When using this approach, it is common to consider the existence of some oracle of an unbiased estimate of an element of the subgradient that enables one to build up the approximation $f_{x_k}$ with favorable properties (see e.g., [2] or [4]). In our case we assume such an oracle is not available, and we only get access, at a point x, to a noisy function value observation $F(x,\xi )$. Stochastic problems with only functional information available often arise in optimization, machine learning and statistics. A classic example is simulation based optimization (see e.g., [5, 6] and references therein), where function evaluations usually represent the experimentally obtained behavior of a system and in practice are given by means of specific simulation tools, hence no internal or analytical knowledge for the functions is provided. Furthermore, evaluating the function at a given point is in many cases a computationally expensive task, and only a limited budget of evaluations is available in the end. Recently, suitable derivative free/zeroth order optimization methods have been proposed for handling stochastic functions (see e.g., [7,8,9,10]). For a complete overview of stochastic derivative free/zeroth order methods, we refer the interested reader to the recent review [6].

Weakly convex functions show up in the modeling of many different statistical learning applications like, e.g., (robust) phase retrieval, sparse dictionary learning, conditional value at risk (see [2] for a complete description of those problems). Other interesting applications include the training of neural networks with Exponentiated Linear Units (ELUs) activation functions [11] and machine learning problems with L-smooth loss functions (see e.g., [12] and references therein).

In all these problems there might be cases where we only get access, at a point x, to an unbiased estimate of the loss function $F(x,\xi )$ and we thus need to resort to a stochastic derivative free/zeroth order approach in order to handle our problem. Recalling that a standard setting is wherein a function evaluation is the noisy output of some complex simulation, such a problem can appear either for an inverse problem where we are interested in using a robust nonsmooth loss function to match parameters to a nonconvex simulation, i.e., $F(x,\xi )=\sum _i\Vert G(x,\xi _i)-o_i\Vert _1$ where $\{o_i\}$ is a the set of observations and $\{\xi _i\}$ a set of samples of the simulation run, which is of the form of the composite case h(c(x)) described above, or even a simulation function that is convex but we are interested in, e.g., minimizing its conditional value at risk.

At the time of writing, zeroth order, or derivative free optimization for weakly convex problems has not been investigated. There are a number of works for stochastic nonconvex zeroth order optimization (e.g., [13]) and nonsmooth convex derivative free optimization (e.g., [9]).

In the case of stochastic weakly convex optimization but with access to a noisy element of the subgradient, there are a few works that have appeared fairly recently. Asymptotic convergence was shown in [4], which proves convergence with probability one for the method given in (3). Non-asymptotic convergence, as in convergence rates in expectation, is given in the two papers [2] and [14].

In this paper, we follow the approach proposed in [9] to handle nonsmoothness in our problem. We consider a smoothed version of the objective function, and we then apply a two point strategy to estimate its gradient. This tool is thus embedded in a proximal algorithm similar to the one described in [2] and enables us to get convergence at a similar rate as the original method (although with larger constants).

The rest of the paper is organized as follows. In Sect. 2 we describe the algorithm and provide some preliminary lemmas needed for the subsequent analysis. Section 3 contains the convergence proof. In Section 4 we show some numerical results on two standard test cases. Finally we conclude in Sect. 5.

2 Two point estimate and algorithmic scheme

We use the two point estimate presented in [9] to generate an approximation to an element of the subdifferential. In particular, consider the randomized smoothing of the function f,

$$\begin{aligned} f_{u}(x) =\mathbb {E}[f(x+uz)] = \int f(x+uz) dZ \end{aligned}$$

where Z is the pdf of a standard normal variable, i.e., we take an expectation for $z\sim \mathcal {N}(0,I_n)$.

The two point estimate we use is given by considering a second smoothing, now of $f_{u_{1,t}}$ for a given $u_{1,t}$ indexed by iteration t, i.e.,

$$\begin{aligned} f_{u_{1,t},u_{2,t}}(x) =\mathbb {E}[f_{u_{1,t}}(x+u_{2,t}z)] = \int f_{u_{1,t}}(x+u_{2,t}z) dZ. \end{aligned}$$

To derive the specific step computed, let us consider the derivative of this function with respect to x. We first write,

$$\begin{aligned} \begin{array}{ll} f_{u_{1,t}u_{2,t}}(x) &{} = \int f_{u_{1,t}}(x+u_{2,t}z) dZ = \frac{1}{\kappa } \int f_{u_{1,t}}(x+u_{2,t} v) e^{-\frac{\Vert v\Vert ^2}{2}} dv \\ &{} = \frac{1}{\kappa u_{2,t}^n} \int f_{u_{1,t}}(y) e^{-\frac{\Vert y-x\Vert ^2}{2u_{2,t}^2}} dy, \end{array} \end{aligned}$$

where

$$\begin{aligned} \kappa := \int e^{-\frac{\Vert v\Vert ^2}{2}} dv = (2\pi )^{n/2} \end{aligned}$$

and we used the change of variables $y=x+u_{2,t} v$. Now we write,

$$\begin{aligned} \begin{array}{ll} \nabla f_{u_{1,t},u_{2,t}}(x) &{} = \frac{1}{\kappa u_{2,t}^{n+2}} \int f_{u_{1,t}}(y) e^{-\frac{\Vert y-x\Vert ^2}{2u_{2,t}^2}} (y-x) dy \\ &{} = \frac{1}{\kappa u_{2,t}} \int f_{u_{1,t}}(x+u_{2,t} v) e^{-\frac{\Vert v\Vert ^2}{2}} v dv \\ &{} = \frac{1}{\kappa } \int \frac{f_{u_{1,t}}(x+u_{2,t} v)-f(x)}{u_{2,t}} e^{-\frac{\Vert v\Vert ^2}{2}} v dv \\ &{} = \int \frac{f_{u_{1,t}}(x+u_{2,t} z)-f(x)}{u_{2,t}} z dZ. \end{array} \end{aligned}$$

(4)

where the third equality comes from the fact that the function $ve^{-\frac{\Vert v\Vert ^2}{2}}$ is even so integration over f(x) is zero.

Now let $\{u_{1,t}\}_{t=1}^\infty $, $\{u_{2,t}\}_{t=1}^\infty $ be two nonincreasing sequences of positive parameters such that $u_{2,t}\le u_{1,t}/2$, $x_t$ is the given point, $\xi _t$ is a sample of the stochastic oracle $\Xi $, $Z_1\sim \mu _1$ and $Z_2\sim \mu _2$ are two vectors independently sampled from distributions $\mu _1\sim \mathcal {N}(0,I_n)$ and $\mu _2\sim \mathcal {N}(0,I_n)$. From the derivation above, we can see that the quantity,

$$\begin{aligned} {\begin{matrix} g_t(x) &{}=G(x,u_{1,t},u_{2,t},Z_{1,t},Z_{2,t},\xi _t) =\\ &{} = \frac{F(x+u_{1,t} Z_{1,t}+u_{2,t} Z_{2,t};\xi _t)-F(x+u_{1,t} Z_{1,t};\xi _t)}{u_{2,t}} Z_{2,t}, \end{matrix}} \end{aligned}$$

(5)

is an unbiased estimator of $\nabla f_{u_{1,t},u_{2,t}}(x)$. Thus, effectively, the first random variable $u_{1,t} Z_{1,t}$ smooths out the nonsmooth function F and the second $u_{2,t} Z_{2,t}$ obtains a zeroth order estimate, using noisy function computations, of its derivative. We shall use $g_t(x)$ specifically in our algorithm at each iteration. We highlight the importance of using an adequate random number generator to compute $Z_{1,t}$, $Z_{2,t}$ and stochastic function realization $\xi _t$ at every iteration. We hence have that the two samples used for $\xi _t$ and $Z_{1,t}$ are the same in $F(x+u_{1,t} Z_{1,t}+u_{2,t} Z_{t,2};\xi _t)$ and $F(x+u_{1,t} Z_{1,t};\xi _t)$, making the two point estimator essentially a common random number device.

We now report some results that provide theoretical guarantees on the error in the estimate. These results appear in [15], however we include some of their (short) proofs for completeness.

Lemma 1

[15, Lemma 1] It holds that,

$$\begin{aligned} \frac{1}{\kappa } \int \Vert v\Vert ^p e^{-\frac{\Vert v\Vert ^2}{2}}dv\le n^{p/2}, \end{aligned}$$

(6)

with $p\in [0,2]$, and

$$\begin{aligned} n^{p/2}\le \frac{1}{\kappa } \int \Vert v\Vert ^p e^{-\frac{\Vert v\Vert ^2}{2}}dv\le (p+n)^{p/2}, \end{aligned}$$

(7)

with $p\ge 2$.

Lemma 2

[15, Theorem 1] It holds that,

$$\begin{aligned} \left| f_{u_{1,t}}(x)-f(x)\right| \le u_{1,t} L_0 \sqrt{n}, \end{aligned}$$

(8)

with $L_0$ Lipschitz constant for f.

Proof

Indeed,

$$\begin{aligned} \begin{array}{ll} \left| f_{u_{1,t}}(x)-f(x)\right| &{} \le \frac{1}{\kappa } \int \left| f(x+u_{1,t}v)-f(x)\right| e^{-\frac{\Vert v\Vert ^2}{2}}dv\le \frac{u_{1,t}L_0}{\kappa }\int \left\| v\right\| e^{-\frac{\Vert v\Vert ^2}{2}}dv \\ &{} \le u_{1,t} L_0 \sqrt{n}, \end{array} \end{aligned}$$

where we have used the Lipschitz constant $L_0$ for f as given in Assumption 1 and the last inequality follows from Eq. (6) in Lemma 1.

Lemma 3

[15, Lemma 2] The function $f_{u_{1,t}}$ is Lipschitz continuously differentiable with constant $\frac{L_0\sqrt{n}}{u_{1,t}}$.

Proof

$$\begin{aligned} \begin{array}{ll} \left\| \nabla f_{u_{1,t}}(x)-\nabla f_{u_{1,t}}(y)\right\| &{} \le \frac{1}{u_{1,t}\kappa } \int \left| f(x+u_{1,t} v)-f(y+u_{1,t} v)\right| e^{-\frac{\Vert v\Vert ^2}{2}} \Vert v\Vert dv \\ &{} \le \frac{L_0}{u_{1,t}\kappa } \Vert x-y\Vert \int e^{-\frac{\Vert v\Vert ^2}{2}} \Vert v\Vert dv \le \frac{L_0\sqrt{n}}{u_{1,t}} \Vert x-y\Vert . \end{array} \end{aligned}$$

The condition proved in Lemma 3 is equivalent to the following inequality (see e.g., [15]):

$$\begin{aligned} \left| f_{u_{1,t}} (y)- f_{u_{1,t}} (x)-\langle \nabla f_{u_{1,t}} (x),(y-x)\rangle \right| \le \frac{L_0\sqrt{n}}{ u_{1,t}}\Vert x-y\Vert ^2. \end{aligned}$$

(9)

Lemma 4

[15, Lemma 3] It holds that

$$\begin{aligned} \left\| \nabla f_{u_{1,t},u_{2,t}} (x) - \nabla f_{u_{1,t}} (x)\right\| \le \frac{u_{2,t}L_0\sqrt{n}(n+3)^{3/2}}{2 u_{1,t}} \le \frac{u_{2,t}}{u_{1,t}}\bar{\sigma }, \end{aligned}$$

(10)

with $\bar{\sigma }=\frac{L_0\sqrt{n}(n+3)^{3/2}}{2}$.

Proof

First, note that

$$\begin{aligned} \nabla f_{u_{1,t}} (x) = \frac{1}{\kappa } \int \langle \nabla f_{u_{1,t}} (x),v\rangle e^{-\frac{\Vert v\Vert ^2}{2}} v dv. \end{aligned}$$

And so,

$$\begin{aligned}&\left\| \nabla f_{u_{1,t},u_{2,t}} (x) - \nabla f_{u_{1,t}} (x)\right\| \\&= \left\| \frac{1}{\kappa }\int \left( \frac{f_{u_{1,t}} (x+u_{2,t}v)- f_{u_{1,t}} (x)}{u_{2,t}}- \langle \nabla f_{u_{1,t}} (x),v\rangle \right) v e^{-\frac{\Vert v\Vert ^2}{2}} dv \right\| \\&\le \frac{1}{\kappa u_{2,t}} \int \left| f_{u_{1,t}} (x+u_{2,t}v)- f_{u_{1,t}} (x)-u_{2,t} \langle \nabla f_{u_{1,t}} (x),v\rangle \right| \Vert v\Vert e^{-\frac{\Vert v\Vert ^2}{2}} dv \\&\le \frac{u_{2,t}L_0\sqrt{n}}{2\kappa u_{1,t}} \int \Vert v\Vert ^3 e^{\frac{-\Vert v\Vert ^2}{2}}dv \le \frac{u_{2,t}L_0\sqrt{n}(n+3)^{3/2}}{2 u_{1,t}}, \end{aligned}$$

where the first inequality uses some basic property of the integrals, the second inequality uses equation (9) coming from Lemma 3, and the last inequality uses equation (7) in Lemma 1.

We further report one more useful preliminary result.

Lemma 5

The following inequality holds:

$$\begin{aligned} \langle \nabla f_{u}(x)-\nabla f_{u}(y),x-y\rangle \ge -\rho \Vert x-y\Vert ^2-4L_0 u \Vert x-y\Vert . \end{aligned}$$

Proof

By using the definition of $f_{u_{1,t}}(x)$, we have

$$\begin{aligned} \begin{array}{l} \langle \nabla f_{u}(x)-\nabla f_{u}(y),x-y\rangle = \left\langle \nabla \left( \int \left( f(x+uz)-f(y+uz)\right) dZ\right) ,x-y\right\rangle \end{array} \end{aligned}$$

After a proper rewriting, we use (2) to get a lower bound on the considered term, for any given vector $e_x$ of n components and any one element equal to one, we have,

$$\begin{aligned}&\left\langle \left( \lim _{t\rightarrow 0}\frac{\int \left( f(x+uz+te_x)-f(x+uz)-f(y+uz+te_x)+f(y+uz)\right) }{t} dZ\right) ,x-y\right\rangle \\&\ge -\rho \Vert x-y\Vert ^2+\\ &+\left\langle \left( \lim _{t\rightarrow 0}\frac{\int \left( f(x+uz+te_x)-f(x+te_x)-f(x+uz)+f(x)-f(y+uz+te_x)+f(y+te_x)+f(y+uz)-f(y)\right) }{t} dZ\right) ,x-y\right\rangle \\&\ge -\rho \Vert x-y\Vert ^2-4L_0u\Vert x-y\Vert , \end{aligned}$$

where the last inequality is obtained from the Lipschitz property of f (Assumption 1).

We make the following Assumption on f:

Assumption 2

It holds that $F(\cdot ,\xi )$ is $L(\xi )$-Lipschitz and $L(P):=\sqrt{\mathbb {E}[L(\xi )^2]}$ is finite.

The following lemma uses previous results to characterizes an important condition on the error of the estimate.

Lemma 6

Given a point x s.t. $\Vert x\Vert \le M$, with M a finite positive value, then it holds that

$$\begin{aligned} \mathbb {E}[\Vert g_t(x)\Vert ^2] \le \hat{C}. \end{aligned}$$

(11)

where $\hat{C}$ depends on M, L(P) and n but is independent of x.

Proof

Define $\hat{f}(x)=f(x)+\rho \Vert x\Vert ^2$ for $\Vert x\Vert \le M$ and a continuous linearly growing extension otherwise (e.g., for any x take the greatest norm subgradient g(x) at $\frac{M x}{\Vert x\Vert }$ and linearize, $\hat{f}(x) = \hat{f}\left( \frac{M x}{\Vert x\Vert }\right) +g(x)^T (x-\frac{M x}{\Vert x\Vert })$). Note that by this construction and the assumptions on f(x), it holds that $\hat{f}(x)$ is convex and Lipschitz. Let $\hat{g}_t(x)$ be the two point gradient approximation of $\hat{f}(x)$, defining $\hat{f}_{u_{1,t}}(x)$ accordingly. Furthermore, let $h(x) = \hat{f}(x)-f(x)$, $\hat{g}^h_t(x)$ its two point gradient approximation, and $h_{u_{1,t}}(x)$ its smoothed function. We have,

$$\begin{aligned} \Vert g_t(x)\Vert =\Vert \hat{g}_t(x)-\hat{g}_t^h(x)\Vert \le \Vert \hat{g}_t(x)\Vert +\Vert \hat{g}_t^h(x)\Vert . \end{aligned}$$

Since $\hat{f}_{u_{1,t}}$ and $h_{u_{1,t}}$ are both Lipschitz and convex, we now directly apply [9, Lemma 2] to both errors on the right hand side to obtain the final result.

Note that the last lemma combined with the previous results implies a tighter bound on $\Vert \nabla f_{u_{1,t}}(x)\Vert ^2$, specifically,

$$\begin{aligned} \Vert \nabla f_{u_{1,t}}(x)\Vert ^2&\le 3\Vert \nabla f_{u_{1,t}}(x)-\nabla f_{u_{1,t},u_{2,t}}(x)\Vert ^2+3\mathbb {E}\Vert g_t(x)-\nabla f_{u_{1,t},u_{2,t}}(x)\Vert ^2 \\ \nonumber&\quad +3\mathbb {E}\Vert g_t(x)\Vert ^2 \le 3u^2_{2,t}\bar{\sigma }^2/u^2_{1,t}-6\mathbb {E}\left\langle g_t(x),\nabla f_{u_{1,t},u_{2,t}}(x)\right\rangle \\ \nonumber&\quad +3 \Vert \nabla f_{u_{1,t},u_{2,t}}(x)\Vert ^2+6\mathbb {E}\Vert g_t(x)\Vert ^2 \le 3u^2_{2,t}\bar{\sigma }^2/u^2_{1,t}+6\hat{C} \le \bar{C}. \end{aligned}$$

(12)

In order to get the first inequality, we used some basic properties of the expectation and the inequality $(a+b+c)^2\le 3a^2+3b^2+3c^2.$ Then we used Lemma 4 to upper bound the first term in the summation and suitably rewrote the second one thus getting the RHS of the second inequality. The third one was finally obtained by taking into account unbiasedness of $g_t(x)$ (i.e., $\mathbb {E}[g_t(x)]=\nabla f_{u_{1,t}u_{2,t}}(x))$ and Lemma 6.

The algorithmic scheme used in the paper is reported in Algorithm 1. At each iteration t we simply build a two point estimate $g_t$ of the gradient related to the smoothed function and then apply a proximal map to the point $x_t-\alpha _t g_t$, with $\alpha _t>0$ a suitably chosen stepsize.

We let $\alpha _t$ be a diminishing step-size and set

$$\begin{aligned} u_{1,t}= \alpha _t^2\quad \text{ and }\quad u_{2,t} = \alpha _t^3. \end{aligned}$$

(13)

We thus have in our scheme a derivative free version of Algorithm 3.1 reported in [2].

3 Convergence of the derivative free algorithm

We now analyze the convergence properties of Algorithm 1. We follow [2, Sect. 3.2] in the proof of our results. We consider a value $\bar{\rho }> \rho $, and assume $\alpha _t<\min \left\{ \frac{1}{\bar{\rho }},\frac{\bar{\rho }-\rho }{2}\right\} $ for all t.

We first define the function

$$\begin{aligned} \phi ^{u,t}(x) = f_{u_{1,t}}(x)+r(x), \end{aligned}$$

and introduce the Moreau envelope function

$$\begin{aligned} \phi ^{u,t}_{1/\lambda }(x) =\min _y \phi ^{u,t}(y)+\frac{\lambda }{2}\Vert y-x\Vert ^2\ , \end{aligned}$$

with the proximal map

$$\begin{aligned} \text {prox}_{\phi ^{u,t}/\lambda } (x)= \displaystyle {{\,\mathrm{argmin}\,}}_y \{\phi ^{u,t}(y)+\frac{\lambda }{2}\Vert y-x\Vert ^2\}. \end{aligned}$$

We use the corresponding definition of $\phi _{1/\lambda }(x)$ as well in the convergence theory,

$$\begin{aligned} \phi _{1/\lambda }(x) =\min _y \phi (y)+\frac{\lambda }{2}\Vert y-x\Vert ^2 = \min _y f(y)+r(y)+\frac{\lambda }{2}\Vert y-x\Vert ^2. \end{aligned}$$

To begin with let

$$\begin{aligned} \hat{x}_t = \text {prox}_{\phi ^{u,t}/\bar{\rho }} (x_t). \end{aligned}$$

Some of the steps follow along the same lines given in [2, Lemma 3.5], owing to the smoothness of $f_{u_{1,t}}(x)$.

We derive the following recursion lemma, which establishes an important descent property for the iterates. We denote by $\mathbb {E}_t$ the conditional expectation with respect to the $\sigma $-algebra of random events up to iteration t, i.e., all of $Z_{1,s}$, $Z_{2,s}$ and $\xi _{s}$ are given for $s<t$, and for $s\ge t$ are random variables. In order to derive this lemma, we require an additional assumption that is reasonable in this setting.

Assumption 3

The sequence $\{x_t\}$ generated by the algorithm is bounded (i.e., there exists an $M>0$ s.t., $\Vert x_t\Vert \le M$ for all t).

Note that this assumption can be satisfied if, for instance, $r(\cdot )=\sum \limits _{j=1}^J r_j(\cdot )$ and for at least one $j\in \{1,...J\}$, $r_j(\cdot )$ is an indicator for a compact set $\mathcal {X}$ (i.e., $r(x)=0$ if $x\in \mathcal {X}$ and $r(x)=\infty $ otherwise).

Lemma 7

Let $\alpha _t$ satisfy,

$$\begin{aligned} \alpha _t \le \frac{\bar{\rho }-\rho }{(1+\bar{\rho }^2-2\bar{\rho }\rho +4\delta _0 L_0)}\ . \end{aligned}$$

(14)

where $\delta _0=1-\alpha _0 \bar{\rho }$.

Then it holds that there exists a B independent of t such that

$$\begin{aligned} \mathbb {E}_t\Vert x_{t+1}-\hat{x}_t\Vert ^2 \le \Vert x_t-\hat{x}_t\Vert ^2+\alpha ^2_t B-\alpha _t(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2. \end{aligned}$$

Proof

First we see that $\hat{x}_t$ can be obtained as a proximal point of r:

$$\begin{aligned} \begin{array}{l} \hat{x}_t = \text {prox}_{\phi ^{u,t}/\bar{\rho }} (x_t) \Longleftrightarrow \\ \\ \qquad \bar{\rho } (x_t-\hat{x}_t)\in \partial r(\hat{x}_t)+\nabla f_{u_{1,t}}(\hat{x}_t) \Longleftrightarrow \\ \\ \qquad \alpha _t \bar{\rho } (x_t-\hat{x}_t)\in \alpha _t \partial r(\hat{x}_t)+\alpha _t \nabla f_{u_{1,t}}(\hat{x}_t) \Longleftrightarrow \\ \\ \qquad \alpha _t \bar{\rho } x_t -\alpha _t \nabla f_{u_{1,t}}(\hat{x}_t) +(1-\alpha _t \bar{\rho } )\hat{x}_t \in \hat{x}_t+\alpha _t \partial r(\hat{x}_t) \\ \\ \qquad \Longleftrightarrow \hat{x}_t =\text {prox}_{\alpha _t r}\left( \alpha _t \bar{\rho } x_t-\alpha _t \nabla f_{u_{1,t}}(\hat{x}_t)+(1-\alpha _t \bar{\rho })\hat{x}_t\right) . \end{array} \end{aligned}$$

We notice that the last equivalence follows from the optimality conditions related to the proximal subproblem. Letting $\delta _t=1-\alpha _t \bar{\rho }$, we get,

$$\begin{aligned}&\mathbb {E}_t\Vert x_{t+1}-\hat{x}_t\Vert ^2=\mathbb {E}_t\Vert \text {prox}_{\alpha _t r}(x_t-\alpha _t g_t)-\text {prox}_{\alpha _t r} (\alpha _t \bar{\rho }x_t -\alpha _t \nabla f_{u_{1,t}}(x_t)+\delta _t \hat{x}_t)\Vert ^2 \\&\le \mathbb {E}_t\left\| x_t-\alpha _t g_t-(\alpha _t \bar{\rho }x_t-\alpha _t \nabla f_{u_{1,t}}(\hat{x}_t)+\delta _t \hat{x}_t)\right\| ^2, \end{aligned}$$

where the inequality is obtained by considering the non-expansiveness property of the proximal map $\text {prox}_{\alpha _t r}(x)$. We thus can write the following chain of equalities:

$$\begin{aligned}&\mathbb {E}_t\left\| x_t-\alpha _t g_t-(\alpha _t \bar{\rho }x_t-\alpha _t \nabla f_{u,1}(\hat{x}_t)+\delta _t \hat{x}_t)\right\| ^2= \\&= \mathbb {E}_t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (g_t-\nabla f_{u_{1,t}}(\hat{x}_t) ) \right\| ^2 =\\&=\mathbb {E}_t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )-\alpha _t(g_t-\nabla f_{u_{1,t}}(x_t)) \right\| ^2 = \\&=\mathbb {E}_t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\&-2\alpha _t\mathbb {E}_t \left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,g_t-\nabla f_{u_{1,t}}(x_t)\right\rangle \right] \\&+\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t}}(x_t)\right\| ^2, \\ \end{aligned}$$

with the first equality obtained by rearranging the terms inside the norm, the second one by simply adding and subtracting $\alpha _t \nabla f_{u_{1,t}}(x_t)$ to those terms, and the third one by taking into account the definition of Euclidean norm and the basic properties of the expectation. Now, we get the following

$$\begin{aligned}&\mathbb {E}_t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\&-2\alpha _t\mathbb {E}_t \left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,g_t-\nabla f_{u_{1,t}}(x_t)\right\rangle \right] \\&+\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t}}(x_t)\right\| ^2\\&= \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\&-2\alpha _t \left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,\mathbb {E}[g_t]-\nabla f_{u_{1,t}}(x_t))\right\rangle \right] \\&+\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t}}(x_t)\right\| ^2 \\&= \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\&-2\alpha _t \left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,\nabla f_{u_{1,t}u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t))\right\rangle \right] \\&+\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t}}(x_t)\right\| ^2. \end{aligned}$$

The first equality, in this case, was obtained by explicitly taking expectation wrt to $\xi _t$, while we used the unbiasedness of $g_t$ (i.e., $\mathbb {E}[g_t]=\nabla f_{u_{1,t}u_{2,t}}(x_t))$ to get the second one. We now upper bound the terms in the summation:

$$\begin{aligned}&\left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2\\&-2\alpha _t\left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,\nabla f_{u_{1,t}u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t))\right\rangle \right] \\&\quad +\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t}}(x_t)\right\| ^2\\&\quad \le \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\\quad &-2\alpha _t \left[ \left\langle \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) ,\nabla f_{u_{1,t}u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t))\right\rangle \right] \\&\quad +2\alpha _t^2\mathbb {E}_t\left\| g_t-\nabla f_{u_{1,t},u_{2,t}}(x_t)\right\| ^2+2\alpha _t^2\mathbb {E}_t\left\| \nabla f_{u_{1,t},u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t)\right\| ^2\\&\quad \le \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2 \\ &\quad +2\left( \alpha _t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t))\right\| \right) \left\| \nabla f_{u_{1,t}u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t)\right\| \\& \quad +2\alpha _t^2\mathbb {E}_t\left\| g_t\right\| ^2-4\alpha _t^2\mathbb {E}_t\left\langle g_t(x_t),\nabla f_{u_{1,t},u_{2,t}}(x_t)\right\rangle +2\alpha _t^2\left\| \nabla f_{u_{1,t},u_{2,t}}(x_t)\right\| ^2\\&+\quad 2\alpha _t^2\mathbb {E}_t\left\| \nabla f_{u_{1,t},u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t)\right\| ^2 \\&\quad \le \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2\\&\quad +\alpha ^2_t\left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) \right\| ^2 +\alpha _t^2 \bar{\sigma }^2 \\&\quad +2\alpha _t^2\hat{C}-2\alpha _t^2\Vert \nabla f_{u_{1,t},u_{2,t}}(x)\Vert ^2+2\alpha _t^4\bar{\sigma }^2. \\&\quad \le \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2\\&\quad +\alpha ^2_t\left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) \right\| ^2 +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \end{aligned}$$

We first split the last term from the previous displayed equation using $(a+b)^2\le 2a^2+2b^2$ and some basic properties of the expectation. The first inequality was obtained by using Cauchy-Schwarz and by suitably rewriting the third term in the summation. We then used the inequality $2a\cdot b\le a^2+b^2$ combined with Lemma 4 (or equation (10)) to bound the resulting second term in the summation, that is $\left\| \nabla f_{u_{1,t}u_{2,t}}(x_t)-\nabla f_{u_{1,t}}(x_t)\right\| ^2$, inputting equation (13) to obtain the explicit constant and relation with respect to $\alpha _t$, and Lemma 6 to upper bound the third term, and finally applying the unbiased estimate property of $g_t$,thus getting the next inequality. Hence we write

$$\begin{aligned}&\left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t) )\right\| ^2+\\&\quad +\alpha ^2_t \left\| \delta _t(x_t-\hat{x}_t)-\alpha _t (\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}}(\hat{x}_t)) \right\| ^2 +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C}\\&\quad = (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2-2(1+\alpha _t^2)\delta _t \alpha _t\langle x_t-\hat{x}_t,\nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}} (\hat{x}_t)\rangle \\&\quad +(1+\alpha _t^2)\alpha _t^2 \Vert \nabla f_{u_{1,t}}(x_t)-\nabla f_{u_{1,t}} (\hat{x}_t)\Vert ^2 +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C}\\&\quad \le (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+8(1+\alpha _t^2)\delta _t L_0 \alpha _t^3 \Vert x_t-\hat{x}_t\Vert \\&\quad +(1+\alpha _t^2)\alpha _t^2 \left( \Vert \nabla f_{u_{1,t}}(x_t)\Vert ^2-2\langle \nabla f_{u_{1,t}}(x_t),\nabla f_{u_{1,t}} (\hat{x}_t)\rangle +\Vert \nabla f_{u_{1,t}} (\hat{x}_t)\Vert ^2 \right) \\&\quad +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad \le (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+8(1+\alpha _t^2)\delta _t L_0 \alpha _t^3 \Vert x_t-\hat{x}_t\Vert \\&\quad +(1+\alpha _t^2)\alpha _t^2 \left( \Vert \nabla f_{u_{1,t}}(x_t)\Vert ^2+2\Vert \nabla f_{u_{1,t}}(x_t)\Vert \Vert \nabla f_{u_{1,t}} (\hat{x}_t)\Vert +\Vert \nabla f_{u_{1,t}} (\hat{x}_t)\Vert ^2 \right) \\&\quad +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad \le (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+8(1+\alpha _t^2)\delta _t L_0 \alpha _t^3 \Vert x_t-\hat{x}_t\Vert \\&\quad +4(1+\alpha _t^2)\alpha _t^2 \bar{C} +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C}, \\ \end{aligned}$$

where the equality is given by rearranging the terms in the summation and taking into account the definition of Euclidean norm. The inequality is obtained by upper bounding the scalar product by means of Lemma 5 and the third term in the summation by combining the triangle inequality, the Cauchy-Schwartz inequality and (12). Continuing:

$$\begin{aligned}&(1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+8(1+\alpha _t^2)\delta _t L_0 \alpha _t^3 \Vert x_t-\hat{x}_t\Vert \\&\quad +4(1+\alpha _t^2)\alpha _t^2 \bar{C} +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad = (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +8(1+\alpha _t^2)\delta _t L_0 \left( \alpha _t^2\right) \left( \alpha _t \Vert x_t-\hat{x}_t\Vert \right) \\&\quad +4(1+\alpha _t^2)\alpha _t^2 \bar{C} +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad \le (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+4(1+\alpha _t^2)\delta _t L_0 \alpha _t^4 \\&\quad +4(1+\alpha _t^2)\delta _t L_0\alpha ^2_t \Vert x_t-\hat{x}_t\Vert ^2+4(1+\alpha _t^2)\alpha _t^2 \bar{C} +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad = (1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2 \Vert x_t-\hat{x}_t\Vert ^2 +4(1+\alpha _t^2)\delta _t L_0 \alpha _t^4+4(1+\alpha _t^2)\alpha _t^2 \bar{C} \\&\quad +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C}. \\ \end{aligned}$$

The first and last equality are simply obtained by rearranging the terms in the summation. The inequality is obtained by upper bounding the third term in the summation using inequality $2a\cdot b\le a^2+b^2$. Finally, recalling the definition of $\delta _t=1-\alpha _t \bar{\rho }$, we have

$$\begin{aligned}&(1+\alpha _t^2)\delta _t^2 \Vert x_t-\hat{x}_t\Vert ^2+2(1+\alpha _t^2)\delta _t\alpha _t \rho \Vert x_t-\hat{x}_t\Vert ^2+4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2 \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +4(1+\alpha _t^2)\delta _t L_0 \alpha _t^4+4(1+\alpha _t^2)\alpha _t^2 \bar{C} +\alpha _t^2(1+2\alpha _t^2) \bar{\sigma }^2+2\alpha _t^2\hat{C} \\&\quad = \left[ 1-2\alpha _t\bar{\rho }+\alpha ^2_t\bar{\rho }^2+\alpha _t^2-2\alpha ^3_t\bar{\rho }+\alpha ^4_t\bar{\rho }^2 +2\alpha _t \rho -2\alpha ^2_t\bar{\rho }\rho +2\alpha ^3_t \rho -2\alpha ^4_t\bar{\rho }\rho \right. \\&\quad \left. +4\delta _t L_0 \alpha _t^2+4\delta _t L_0 \alpha _t^4\right] \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +\left[ 4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2+4(1+\alpha _t^2)\bar{C} +(1+2\alpha _t^2)\bar{\sigma }^2+2\hat{C}\right] \alpha _t^2 \\&\quad = \left[ 1-2\alpha _t(\bar{\rho }-\rho )+\alpha ^2_t(1+\bar{\rho }^2-2\bar{\rho }\rho +4\delta _t L_0)-2\alpha ^3_t(\bar{\rho }-\rho )\right] \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +\alpha ^4_t(\bar{\rho }^2 -2\bar{\rho }\rho +4\delta _t L_0 )\Vert x_t-\hat{x}_t\Vert ^2\\&\quad +\left[ 4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2+4(1+\alpha _t^2)\bar{C} +(1+2\alpha _t^2)\bar{\sigma }^2 + 2 \hat{C}\right] \alpha _t^2 \\&\quad \le \left[ 1-2\alpha _t(\bar{\rho }-\rho )+\alpha _t(\bar{\rho }-\rho )-2\alpha ^3_t(\bar{\rho }-\rho )\right] \Vert x_t-\hat{x}_t\Vert ^2\\&\quad +\alpha ^3_t(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2\\&\quad +\left[ 4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2+4(1+\alpha _t^2)\bar{C} +(1+2\alpha _t^2)\bar{\sigma }^2 + 2 \hat{C}\right] \alpha _t^2 \\&\quad \le \left[ 1-\alpha _t(\bar{\rho }-\rho )\right] \Vert x_t-\hat{x}_t\Vert ^2 \\&\quad +\left[ 4(1+\alpha _t^2)\delta _t L_0 \alpha _t^2+4(1+\alpha _t^2)\bar{C} +(1+2\alpha _t^2)\bar{\sigma }^2 + 2 \hat{C}\right] \alpha _t^2 \\&\quad := \left[ 1-\alpha _t(\bar{\rho }-\rho )\right] \Vert x_t-\hat{x}_t\Vert ^2+B\alpha _t^2, \end{aligned}$$

where the second to last inequality is obtained by simply considering the expression of $\alpha _t$ in Eq. (14). We combine the sequence of inequalities shown in the lemma to obtain the result.

After proving Lemma 7, we can now state the main convergence result for Algorithm 1.

Theorem 1

The sequence generated by Algorithm 1 satisfies,

$$\begin{aligned} \begin{array}{l} \mathbb {E}[\phi _{1/\bar{\rho }}(x_{t+1})] \le \mathbb {E}[\phi _{1/{\bar{\rho }}}(x_t)]+(2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2})\alpha _t^2-\frac{\alpha _t(\bar{\rho }-\rho )}{2\bar{\rho }}\mathbb {E}[\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2]\\ \end{array} \end{aligned}$$

and thus,

$$\begin{aligned} \mathbb {E}[\Vert \nabla \phi ^{u,t^*}_{1/{\bar{\rho }}}(x_{t^*})\Vert ^2]&= \frac{1}{\sum _{t=0}^T \alpha _t} \sum \limits _{t=0}^T \alpha _t \mathbb {E}[\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2]\\ \nonumber&\le \frac{2\bar{\rho }}{\bar{\rho }-\rho } \frac{\phi _{1/\bar{\rho }}(x_0)-\phi _\star +(2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2})\sum \limits _{t=0}^T\alpha _t^2}{\sum \limits _{t=0}^T \alpha _t}. \end{aligned}$$

(15)

In particular, if we define $\alpha _t$ to be

$$\begin{aligned} \alpha _t = \min \left\{ \frac{1}{\bar{\rho }},\frac{\bar{\rho }-\rho }{2},\sqrt{\frac{ \phi _{1/{\bar{\rho }}}(x_0)-\phi _*}{2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2}}}\right\} \frac{1}{\sqrt{T+1}} \end{aligned}$$

(16)

then,

$$\begin{aligned} \mathbb {E}[\Vert \nabla \phi ^{u,t^*}_{1/{\bar{\rho }}}(x_{t^*})\Vert ^2] \le \frac{4\bar{\rho }}{\bar{\rho }-\rho }\sqrt{\frac{(\phi _{1/{\bar{\rho }}}(x_0)-\phi _*) (2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2})}{T+1}}. \end{aligned}$$

(17)

Proof

We have,

$$\begin{aligned}&\mathbb {E}_t[\phi _{1/\bar{\rho }}(x_{t+1})] \le \mathbb {E}_t\left[ \phi (\hat{x}_t)+\frac{\bar{\rho }}{2}\Vert \hat{x}_t-x_{t+1}\Vert ^2\right] \\&\le \phi (\hat{x}_t)+\frac{\bar{\rho }}{2}\left( \Vert x_t-\hat{x}_t\Vert ^2+B\alpha _t^2-\alpha _t(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2\right) \\&\le \phi ^{u,t}(\hat{x}_t)+u_{1,t} L_0 \sqrt{n} +\frac{\bar{\rho }}{2}\left( \Vert x_t-\hat{x}_t\Vert ^2+B\alpha _t^2-\alpha _t(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2\right) , \\ \end{aligned}$$

where the first inequality comes from the definition of the proximal map, the second by considering the result proved in Lemma 7, and the third by Lemma 2. Continuing,

$$\begin{aligned}&\phi ^{u,t}(\hat{x}_t) +u_{1,t} L_0 \sqrt{n}+\frac{\bar{\rho }}{2}\left( \Vert x_t-\hat{x}_t\Vert ^2+B\alpha _t^2-\alpha _t(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2\right) \\&= \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)+u_{1,t} L_0 \sqrt{n}+\frac{B\bar{\rho }}{2}\alpha _t^2-\frac{\bar{\rho }\alpha _t}{2}(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2 \\&\le \phi _{1/{\bar{\rho }}}(x_t)+2u_{1,t} L_0 \sqrt{n}+\frac{B\bar{\rho }}{2}\alpha _t^2-\frac{\bar{\rho }\alpha _t}{2}(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2 \\&\le \phi _{1/{\bar{\rho }}}(x_t)+2 \alpha _t^2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2}\alpha _t^2-\frac{\bar{\rho }\alpha _t}{2}(\bar{\rho }-\rho )\Vert x_t-\hat{x}_t\Vert ^2 \\&= \phi _{1/{\bar{\rho }}}(x_t)+2 \alpha _t^2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2}\alpha _t^2-\frac{\alpha _t(\bar{\rho }-\rho )}{2\bar{\rho }}\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2, \end{aligned}$$

with the first inequality obtained by Lemma 2. Indeed, let us now call $\bar{x}_t$ the minimizer of $\phi (x_t)+\frac{\bar{\rho }}{2}\Vert x-x_t\Vert ^2$ and recall that $\hat{x}_t$ is the minimizer of $\phi ^{u,t}(x_t)+\frac{\bar{\rho }}{2}\Vert x-x_t\Vert ^2$. We have

$$\begin{aligned} \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)=\phi ^{u,t}(x_t)+\frac{\bar{\rho }}{2}\Vert \hat{x}_t-x_t\Vert ^2\le & {} \phi ^{u,t}(x_t)+\frac{\bar{\rho }}{2}\Vert \bar{x}_t-x_t\Vert ^2 \nonumber \\\le & {} \phi (x_t)+u_{1,t}L_0\sqrt{n}+\frac{\bar{\rho }}{2}\Vert \bar{x}_t-x_t\Vert ^2 \nonumber \\= & {} \phi _{1/{\bar{\rho }}}(x_t)+u_{1,t}L_0\sqrt{n}. \end{aligned}$$

The second inequality is obtained by using definition of $u_{1,t}$ in (13). The last equality is due basic properties of the Moreau envelope and to the definition of $\hat{x}_t$ (see the beginning of Lemma 7). Now, we take full expectations and obtain:

$$\begin{aligned} \mathbb {E}[\phi _{1/\bar{\rho }}(x_{t+1})]&\le \mathbb {E}[\phi _{1/{\bar{\rho }}}(x_t)]+2 \alpha _t^2L_0 \sqrt{n}\\&\quad +\frac{B\bar{\rho }}{2}\alpha _t^2-\frac{\alpha _t(\bar{\rho }-\rho )}{2\bar{\rho }}\mathbb {E}[\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2]. \end{aligned}$$

The rest of the proof is as in [2, Theorem 3.4]. In particular, summing the recursion, we get,

$$\begin{aligned} \mathbb {E}[\phi _{1/\bar{\rho }}(x_{T+1})]&\le \mathbb {E}[\phi _{1/{\bar{\rho }}}(x_0)]+(2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2})\sum \limits _{t=0}^T\alpha _t^2\\&\quad -\frac{(\bar{\rho }-\rho )}{2\bar{\rho }}\sum \limits _{t=0}^T \alpha _t\mathbb {E}[\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2]. \end{aligned}$$

Now, noting that

$$\begin{aligned} \phi _{1/\lambda }(x)=\min _y f(y)+r(y)+\frac{\lambda }{2}\Vert y-x\Vert ^2\ge \phi _\star , \end{aligned}$$

where we used the lower boundedness of $f+r$ in Assumption 1, we can finally state that

$$\begin{aligned}&\frac{1}{\sum _{t=0}^T \alpha _t} \sum \limits _{t=0}^T \alpha _t \mathbb {E}[\Vert \nabla \phi ^{u,t}_{1/{\bar{\rho }}}(x_t)\Vert ^2] \le \\&\le \frac{2\bar{\rho }}{\bar{\rho }-\rho } \frac{\phi _{1/\bar{\rho }}(x_0)-\phi _\star +(2L_0 \sqrt{n}+\frac{B\bar{\rho }}{2})\sum \limits _{t=0}^T\alpha _t^2}{\sum \limits _{t=0}^T \alpha _t}. \end{aligned}$$

Since the left-hand side is by definition $\mathbb {E}[\Vert \nabla \phi ^{u,t^*}_{1/{\bar{\rho }}}(x_{t^*})\Vert ^2]$, we get the inequality (15). Furthermore, by plugging the expression of $\alpha _t$ given in (16) into (15), we get the final inequality (17).

Theorem 1 gives an overall bound on the weighted expected norm of the proximal map as the statistical measure of distance to convergence with respect to the number of iterations. The worst case bound is weighted by the possible range of the function the algorithm must traverse, i.e., from the starting value to the global minimum, as well as the error in the iterates in traversing this range due to the inaccuracy in the zeroth order function and noisy subgradient approximations. The order of the convergence is the same as the one reported in [2], however, the constant is larger, given the additional error in the quality of the steps. Note that as the convergence result is stated in a similar formalism, using the gradient of the Moreau envelope, we can interpret this approximate stationarity concept as given in [2, pp 3–4], namely that a small value of $\Vert \nabla \phi _{\lambda }(x)\Vert $ implies that x is near some point $\hat{x}$ (specifically $\Vert x-\hat{x}\Vert =\lambda \Vert \nabla \phi _{\lambda }(x)\Vert $) satisfying a bound to the distance to stationarity, ${{\,\mathrm{dist}\,}}(0,\partial \phi (\hat{x}))\le \Vert \nabla \phi _{\lambda }(x)\Vert $. In this case an additional level of approximation to stationarity is added as we are taking the gradient of the smoothed proximal function, which is itself a perturbation of the original function.

4 Numerical results

In this section, we investigate the numerical performance of Algorithm 1 on a set of standard weakly convex optimization problems defined in [2]. In particular, we consider phase retrieval, which seeks to minimize the function,

$$\begin{aligned} \min _{x\in {\mathbb {R}}^d} \frac{1}{m} \sum \limits _{i=1}^m \left| \langle a_i,x\rangle ^2-b_i\right| \end{aligned}$$

(18)

and blind deconvolution, which seeks to minimize

$$\begin{aligned} \min _{(x,y)\in {\mathbb {R}}^d} \frac{1}{m} \sum \limits _{i=1}^m \left| \langle u_i,x\rangle \langle v_i,y\rangle -b_i\right| . \end{aligned}$$

(19)

Both of these applications are ones in which Common Random Numbers (CRNs) are a reasonable assumption, making two-point gradient estimates relevant. In particular, in (18), the pairs $(a_i,b_i)$ can be held constant between two function evaluations, and in (19), triplets $(u_i,v_i,b_i)$ can be fixed as well.

4.1 Comparison with methods using a stochastic subgradient oracle

We first compare Algorithm 1 with the stochastic subgradient method and the stochastic proximal method in [2]. The goal in this set of experiments is understanding if our approach is competitive with those ones that use a stochastic subgradient oracle and how the practical behavior of the method fits with the theoretical analysis.

We generate random Gaussian measurements in $N(0,I_{d\times d}$) and a target signal $\bar{x}$ uniformly on the random sphere to compute $b_i$ with dimensions $(d,m)=(10,30),(20,60),(40,120)$. We use fixed stepsizes $\alpha _t$ in the range $[10^{-6},10^{-1}]$. We generate ten runs of each algorithm for every dimension and stepsize, and pick the best one according to the final objective value. The total number of iterations used in all cases is 100000.

We show the gap of the different methods when varying the stepsize for both phase retrieval (Fig. 1) and blind deconvolution (Fig. 2).

It is interesting that the zeroth order algorithm performs on par with the ones that use the stochastic subgradient oracle. In particular, our method is more robust to the choice of the stepsize than the stochastic subgradient method and it is competitive with the proximal method. In Figs. 3 and 4, we report the path of the objective values obtained with the stepsize equal to $10^{-4}$ for the instances $(d,m)=(10,30)$. These are nice examples of how good the zeroth order algorithm works when the stepsize is properly chosen.

4.2 Comparison with a naive stochastic variant of NOMAD

Now, in order to understand if Algorithm 1 is somehow competitive with other (stochastic) non-smooth methods from the DFO literature, we report here a preliminary comparison with a naive stochastic variant of NOMAD [16, 17]. More specifically, we consider a mesh adaptive direct-search (MADS) that uses a unit-size sample for each evaluation of the zeroth order oracle^{Footnote 1}. We use 100 randomly generated instances in our tests for both phase retrieval and blind deconvolution problems. We generate random Gaussian measurements in $N(0,I_{d\times d}$) and a target signal $\bar{x}$ uniformly on the random sphere to compute $b_i$ with dimensions $(d,m)=(4,10)$. The choice of restricting the analysis to small dimensional instances is mainly due to the fact that this naive version of NOMAD gives very poor performances on larger dimensional instances. We use fixed stepsizes $\alpha _t\in \{10^{-3},10^{-2}\}$ in our algorithm. We generate ten runs of each algorithm for every problem and pick the best one according to the final objective value. The total number of function values used in all cases is 10000. We considered data and performance profiles [19] when comparing the methods. Specifically, let S be a set of algorithms and P a set of problems. For each $s\in S$ and $p \in P$, let $t_{p,s}$ be the number of function evaluations required by algorithm s on problem p to satisfy the condition

$$\begin{aligned} f(x_k) \le f_L + \tau (f(x_0) - f_L) \end{aligned}$$

(20)

where $0< \tau < 1$ and $f_L$ is the best objective function value achieved by any solver on problem p. Then, performance and data profiles of solver s are the following functions

$$\begin{aligned} \rho _s(\alpha )= & {} \frac{1}{|P|}\left| \left\{ p\in P: \frac{t_{p,s}}{\min \{t_{p,s'}:s'\in S\}}\le \alpha \right\} \right| ,\\ d_s(\kappa )= & {} \frac{1}{|P|}\left| \left\{ p\in P: t_{p,s}\le \kappa (n_p+1)\right\} \right| \end{aligned}$$

where $n_p$ is the dimension of problem p.

We report, in Figs. 5 and 6, the data and performance profiles for the experiments on phase retrieval and blind deconvolution problems, respectively. From the plots it can be seen that our algorithm (with suitable choices of the stepsize) outperforms the naive version of NOMAD for all precisions. We notice that, when $\tau =10^{-3}$, NOMAD does not appear in the plots, hence it never satisfies the condition (20) for this precision. We further report, in Figs. 7 and 8, the box plots related to the function gaps obtained with the algorithms over the 100 instances considered in the tests. Those plots show that our algorithm gets very close to the optimal value for suitable choices of the stepsize.

5 Conclusion

In this paper we studied, for the first time, minimization of a stochastic weakly convex function without the presence of an oracle of a noisy estimate of the subgradient of the function, i.e., in the context of derivative-free or zeroth order optimization. We were able to derive theoretical convergence rate results on par with the standard methods for stochastic weakly convex optimization, and demonstrated the algorithm’s efficacy on a couple of standard test cases. In expanding the scope of zeroth order optimization, we hope that this work highlights the potential of derivative free methods in general, and the two point smoothed function approximation technique in particular, to an increasingly wider class of problems.

Notes

We would like to notice that the MADS algorithm was originally developed for deterministic blackbox optimization. Recently a stochastic variant of this approach was proposed in [18].

References

Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, Berlin (2009)
MATH Google Scholar
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)
Article MathSciNet Google Scholar
Davis, D., Grimmer, B.: Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM J. Optim. 29(3), 1908–1930 (2019)
Article MathSciNet Google Scholar
Duchi, J.C., Ruan, F.: Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28(4), 3229–3259 (2018)
Article MathSciNet Google Scholar
Amaran, S., Sahinidis, N.V., Sharda, B., Bury, S.J.: Simulation optimization: a review of algorithms and applications. Ann. Oper. Res. 240(1), 351–380 (2016)
Article MathSciNet Google Scholar
Larson, J., Menickelly, M., Wild, S. M.: Derivative-free optimization methods. Acta Numer. 28, 287–404 (2019)
Article MathSciNet Google Scholar
Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust-region method via supermartingales. Inf. J. Optim. 1(2), 92–119 (2019)
MathSciNet Google Scholar
Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Program. 169(2), 447–487 (2018)
Article MathSciNet Google Scholar
Duchi, J.C., Jordan, M.I., Wainwright, M.J., Wibisono, A.: Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE Trans. Inf. Theory 61(5), 2788–2806 (2015)
Article MathSciNet Google Scholar
Larson, J., Billups, S.C.: Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 64(3), 619–645 (2016)
Article MathSciNet Google Scholar
Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). (2015) arXiv preprint arXiv:1511.07289
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet Google Scholar
Balasubramanian, K., Ghadimi, S.: Zeroth-order nonconvex stochastic optimization: Handling constraints, high-dimensionality, and saddle-points. pp 651–676 (2019) arXiv preprint arXiv:1809.06474
Li, X., Zhu, Z., So, A.M., Lee, J.D.: Incremental methods for weakly convex optimization. (2019) arXiv preprint arXiv:1907.11687
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017)
Article MathSciNet Google Scholar
Audet, C., Le Digabel, S., Tribes, C., Rochon Montplaisir, V.: The NOMAD project. Software available at https://www.gerad.ca/nomad/
Audet, C., Le Digabel, S., Tribes, C.: NOMAD user guide. Technical Report G-2009-37, Les cahiers du GERAD (2009)
Audet, C., Dzahini, K.J., Kokkolaras, M., Le Digabel, S.: Stomads: stochastic blackbox optimization using probabilistic estimates. (2019) arXiv preprint arXiv:1911.01012
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20(1), 172–191 (2009)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are indebted to the referees for their constructive suggestions which helped to improve on the earlier version of this article.

Funding

Open access funding provided by Università degli Studi di Padova within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic
V. Kungurtsev
Dipartimento di Matematica “Tullio Levi-Civita”, Università di Padova, Via Trieste, 63, 35121, Padua, Italy
F. Rinaldi

Authors

V. Kungurtsev
View author publications
You can also search for this author in PubMed Google Scholar
F. Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to F. Rinaldi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research supported by the OP VVV project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics”

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kungurtsev, V., Rinaldi, F. A zeroth order method for stochastic weakly convex optimization. Comput Optim Appl 80, 731–753 (2021). https://doi.org/10.1007/s10589-021-00313-3

Download citation

Received: 19 February 2020
Accepted: 26 August 2021
Published: 01 September 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10589-021-00313-3

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A zeroth order method for stochastic weakly convex optimization

Abstract

Similar content being viewed by others

Primal and dual mixed-integer least-squares: distributional statistics and global algorithm

On the linear convergence rate of Riemannian proximal gradient method

Recovery of Coefficients in Semilinear Transport Equations

1 Introduction

Assumption 1

2 Two point estimate and algorithmic scheme

Lemma 1

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Assumption 2

Lemma 6

Proof

3 Convergence of the derivative free algorithm

Assumption 3

Lemma 7

Proof

Theorem 1

Proof

4 Numerical results

4.1 Comparison with methods using a stochastic subgradient oracle

4.2 Comparison with a naive stochastic variant of NOMAD

5 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation