A zeroth order method for stochastic weakly convex optimization

In this paper, we consider stochastic weakly convex optimization problems, however without the existence of a stochastic subgradient oracle. We present a derivative free algorithm that uses a two point approximation for computing a gradient estimate of the smoothed function. We prove convergence at a similar rate as state of the art methods, however with a larger constant, and report some numerical results showing the effectiveness of the approach.


Introduction
In this paper, we study the following class of problems: with f (⋅) ∶ ℝ n → ℝ a stochastic, weakly convex, and potentially nonsmooth (i.e., not necessarily continuously differentiable) function, and r(⋅) ∶ ℝ n →R (i.e., it is (1) min x∈ℝ n (x) ∶= f (x) + r(x), extended real valued) is convex but not necessarily even continuous, however r(x) satisfies some additional conditions detailed below. Furthermore, we consider the derivative free or zeroth order context, wherein the subgradients f , or unbiased estimates thereof, are not available, but only unbiased estimates of function evaluations f(x) are available. We thus write with {F(⋅, ), ∈ Ξ} a collection of real valued functions and P a probability distribution over the set Ξ to be precise.
We define two quantitative assumptions regarding f (⋅) and r(⋅) below. First, we define the notion of a proximal map, in particular with any constant and any convex function h we can write prox h to indicate the following function: The associated optimality condition is We shall make use of the nonexpansiveness property of the proximal mapping in the sequel, We now state our standing assumption on the properties of (1):
We shall denote the lower bound of by ⋆ = f ⋆ + r ⋆ . We further assume that the proximal map of r(x) can be evaluated at low computational complexity cost. We note that the -weak convexity property for a given function f is equivalent to hypomonotonicity of its subdifferential map, that is for v ∈ f (x) and w ∈ f (y) (see e.g., [1,Example 12.28,p 549]).
The class of weakly convex functions is a special yet very common case of nonconvex functions, which contains all convex (possibly nonsmooth) functions and Lipschitz smooth functions. One standard subset of weakly convex functions is given by the composite function f (x) = h(c(x)) where h is nonsmooth and convex and c(x) is continuously differentiable but non-convex (see e.g., [2] and references prox h (x) = argmin y {h(y) + 1 2 ‖y − x‖ 2 }.
(2) ⟨v − w, x − y⟩ ≥ − ‖x − y‖ 2 therein). The additive composite class is another widely used class of weakly convex functions [3], formed from all sums g(x) + l(x) with l closed and convex and g continuously differentiable.
One method for solving a weakly convex stochastic optimization problem is given as repeated iterations of, where k > 0 is a stepsize sequence, typically taken to satisfy k → 0 , and f x k (y;S k ) is approximating f at x k using a noisy estimate S k of the data. A basic stochastic subgradient method will use the linear model where ≈̄∈ f (x k ) . When using this approach, it is common to consider the existence of some oracle of an unbiased estimate of an element of the subgradient that enables one to build up the approximation f x k with favorable properties (see e.g., [2] or [4]). In our case we assume such an oracle is not available, and we only get access, at a point x, to a noisy function value observation F(x, ) . Stochastic problems with only functional information available often arise in optimization, machine learning and statistics. A classic example is simulation based optimization (see e.g., [5,6] and references therein), where function evaluations usually represent the experimentally obtained behavior of a system and in practice are given by means of specific simulation tools, hence no internal or analytical knowledge for the functions is provided. Furthermore, evaluating the function at a given point is in many cases a computationally expensive task, and only a limited budget of evaluations is available in the end. Recently, suitable derivative free/zeroth order optimization methods have been proposed for handling stochastic functions (see e.g., [7][8][9][10]). For a complete overview of stochastic derivative free/zeroth order methods, we refer the interested reader to the recent review [6].
Weakly convex functions show up in the modeling of many different statistical learning applications like, e.g., (robust) phase retrieval, sparse dictionary learning, conditional value at risk (see [2] for a complete description of those problems). Other interesting applications include the training of neural networks with Exponentiated Linear Units (ELUs) activation functions [11] and machine learning problems with L-smooth loss functions (see e.g., [12] and references therein).
In all these problems there might be cases where we only get access, at a point x, to an unbiased estimate of the loss function F(x, ) and we thus need to resort to a stochastic derivative free/zeroth order approach in order to handle our problem. Recalling that a standard setting is wherein a function evaluation is the noisy output of some complex simulation, such a problem can appear either for an inverse problem where we are interested in using a robust nonsmooth loss function to match parameters to a nonconvex simulation, i.e., F(x, a the set of observations and { i } a set of samples of the simulation run, which is of the form of the composite case h(c(x)) described above, or even a simulation function that is convex but we are interested in, e.g., minimizing its conditional value at risk. At the time of writing, zeroth order, or derivative free optimization for weakly convex problems has not been investigated. There are a number of works for stochastic nonconvex zeroth order optimization (e.g., [13]) and nonsmooth convex derivative free optimization (e.g., [9]).
In the case of stochastic weakly convex optimization but with access to a noisy element of the subgradient, there are a few works that have appeared fairly recently. Asymptotic convergence was shown in [4], which proves convergence with probability one for the method given in (3). Non-asymptotic convergence, as in convergence rates in expectation, is given in the two papers [2] and [14].
In this paper, we follow the approach proposed in [9] to handle nonsmoothness in our problem. We consider a smoothed version of the objective function, and we then apply a two point strategy to estimate its gradient. This tool is thus embedded in a proximal algorithm similar to the one described in [2] and enables us to get convergence at a similar rate as the original method (although with larger constants).
The rest of the paper is organized as follows. In Sect. 2 we describe the algorithm and provide some preliminary lemmas needed for the subsequent analysis. Section 3 contains the convergence proof. In Section 4 we show some numerical results on two standard test cases. Finally we conclude in Sect. 5.

Two point estimate and algorithmic scheme
We use the two point estimate presented in [9] to generate an approximation to an element of the subdifferential. In particular, consider the randomized smoothing of the function f, where Z is the pdf of a standard normal variable, i.e., we take an expectation for z ∼ N(0, I n ).
The two point estimate we use is given by considering a second smoothing, now of f u 1,t for a given u 1,t indexed by iteration t, i.e., To derive the specific step computed, let us consider the derivative of this function with respect to x. We first write, where the third equality comes from the fact that the function ve − ‖v‖ 2 2 is even so inte- , {u 2,t } ∞ t=1 be two nonincreasing sequences of positive parameters such that u 2,t ≤ u 1,t ∕2 , x t is the given point, t is a sample of the stochastic oracle Ξ , Z 1 ∼ 1 and Z 2 ∼ 2 are two vectors independently sampled from distributions is an unbiased estimator of ∇f u 1,t ,u 2,t (x) . Thus, effectively, the first random variable u 1,t Z 1,t smooths out the nonsmooth function F and the second u 2,t Z 2,t obtains a zeroth order estimate, using noisy function computations, of its derivative. We shall use g t (x) specifically in our algorithm at each iteration. We highlight the importance of using an adequate random number generator to compute Z 1,t , Z 2,t and stochastic function realization t at every iteration. We hence have that the two samples used for t and Z 1,t are the same in F(x + u 1,t Z 1,t + u 2,t Z t,2 ; t ) and F(x + u 1,t Z 1,t ; t ) , making the two point estimator essentially a common random number device.
We now report some results that provide theoretical guarantees on the error in the estimate. These results appear in [15], however we include some of their (short) proofs for completeness.
where we have used the Lipschitz constant L 0 for f as given in Assumption 1 and the last inequality follows from Eq. (6) in Lemma 1.

Proof
The condition proved in Lemma 3 is equivalent to the following inequality (see e.g., [15]): Proof First, note that And so, where the first inequality uses some basic property of the integrals, the second inequality uses equation (9) coming from Lemma 3, and the last inequality uses equation (7) in Lemma 1.
We further report one more useful preliminary result.

Lemma 5 The following inequality holds:
Proof By using the definition of f u 1,t (x) , we have After a proper rewriting, we use (2) to get a lower bound on the considered term, for any given vector e x of n components and any one element equal to one, we have, where the last inequality is obtained from the Lipschitz property of f (Assumption 1).
We make the following Assumption on f: The following lemma uses previous results to characterizes an important condition on the error of the estimate.
where Ĉ depends on M, L(P) and n but is independent of x.
Proof Define f (x) = f (x) + ‖x‖ 2 for ‖x‖ ≤ M and a continuous linearly growing extension otherwise (e.g., for any x take the greatest norm subgradient g(x) at Mx ‖x‖ and . Note that by this construction and the assumptions on its two point gradient approximation, and h u 1,t (x) its smoothed function. We have, Since f u 1,t and h u 1,t are both Lipschitz and convex, we now directly apply [9, Lemma 2] to both errors on the right hand side to obtain the final result.
Note that the last lemma combined with the previous results implies a tighter bound on ‖∇f u 1,t (x)‖ 2 , specifically, In order to get the first inequality, we used some basic properties of the expectation and the inequality (a + b + c) 2 ≤ 3a 2 + 3b 2 + 3c 2 . Then we used Lemma 4 to upper bound the first term in the summation and suitably rewrote the second one thus getting the RHS of the second inequality. The third one was finally obtained by taking into account unbiasedness of g t (x) (i.e., [g t (x)] = ∇f u 1,t u 2,t (x)) and Lemma 6.
The algorithmic scheme used in the paper is reported in Algorithm 1. At each iteration t we simply build a two point estimate g t of the gradient related to the smoothed function and then apply a proximal map to the point x t − t g t , with t > 0 a suitably chosen stepsize.
We let t be a diminishing step-size and set

3
A zeroth order method for stochastic weakly convex optimization We thus have in our scheme a derivative free version of Algorithm 3.1 reported in [2].

Convergence of the derivative free algorithm
We now analyze the convergence properties of Algorithm 1. We follow [2,Sect. 3.2] in the proof of our results. We consider a value ̄> , and assume t < min 1 ,̄− 2 for all t.
We first define the function and introduce the Moreau envelope function with the proximal map We use the corresponding definition of 1∕ (x) as well in the convergence theory, To begin with let Some of the steps follow along the same lines given in [2, Lemma 3.5], owing to the smoothness of f u 1,t (x).
We derive the following recursion lemma, which establishes an important descent property for the iterates. We denote by t the conditional expectation with respect to the -algebra of random events up to iteration t, i.e., all of Z 1,s , Z 2,s and s are given for s < t , and for s ≥ t are random variables. In order to derive this lemma, we require an additional assumption that is reasonable in this setting.

Assumption 3
The sequence {x t } generated by the algorithm is bounded (i.e., there exists an M > 0 s.t., ‖x t ‖ ≤ M for all t).
Note that this assumption can be satisfied if, for instance, r(⋅) = J ∑ j=1 r j (⋅) and for at least one j ∈ {1, ...J} , r j (⋅) is an indicator for a compact set X (i.e., r(x) = 0 if x ∈ X and r(x) = ∞ otherwise).

Then it holds that there exists a B independent of t such that
Proof First we see that x t can be obtained as a proximal point of r: We notice that the last equivalence follows from the optimality conditions related to the proximal subproblem. Letting t = 1 − t̄ , we get, where the inequality is obtained by considering the non-expansiveness property of the proximal map prox t r (x) . We thus can write the following chain of equalities: with the first equality obtained by rearranging the terms inside the norm, the second one by simply adding and subtracting t ∇f u 1,t (x t ) to those terms, and the third one by taking into account the definition of Euclidean norm and the basic properties of the expectation. Now, we get the following The first equality, in this case, was obtained by explicitly taking expectation wrt to t , while we used the unbiasedness of g t (i.e., [g t ] = ∇f u 1,t u 2,t (x t )) to get the second one. We now upper bound the terms in the summation: We first split the last term from the previous displayed equation using (a + b) 2 ≤ 2a 2 + 2b 2 and some basic properties of the expectation. The first inequality was obtained by using Cauchy-Schwarz and by suitably rewriting the third term in the summation. We then used the inequality 2a ⋅ b ≤ a 2 + b 2 combined with Lemma 4 (or equation (10)) to bound the resulting second term in the summation, that is , inputting equation (13) to obtain the explicit constant and relation with respect to t , and Lemma 6 to upper bound the third term, and finally applying the unbiased estimate property of g t ,thus getting the next inequality. Hence we write where the equality is given by rearranging the terms in the summation and taking into account the definition of Euclidean norm. The inequality is obtained by upper bounding the scalar product by means of Lemma 5 and the third term in the summation by combining the triangle inequality, the Cauchy-Schwartz inequality and (12). Continuing: The first and last equality are simply obtained by rearranging the terms in the summation. The inequality is obtained by upper bounding the third term in the summation using inequality 2a ⋅ b ≤ a 2 + b 2 . Finally, recalling the definition of where the second to last inequality is obtained by simply considering the expression of t in Eq. (14). We combine the sequence of inequalities shown in the lemma to obtain the result.
After proving Lemma 7, we can now state the main convergence result for Algorithm 1.

Theorem 1 The sequence generated by Algorithm 1 satisfies, and thus,
In particular, if we define t to be then, where the first inequality comes from the definition of the proximal map, the second by considering the result proved in Lemma 7, and the third by Lemma 2. Continuing, with the first inequality obtained by Lemma 2. Indeed, let us now call x t the minimizer of (x t ) +̄2 ‖x − x t ‖ 2 and recall that x t is the minimizer of The second inequality is obtained by using definition of u 1,t in (13). The last equality is due basic properties of the Moreau envelope and to the definition of x t (see the beginning of Lemma 7). Now, we take full expectations and obtain:

3
The rest of the proof is as in [2,Theorem 3.4]. In particular, summing the recursion, we get, Now, noting that where we used the lower boundedness of f + r in Assumption 1, we can finally state that Since the left-hand side is by definition [‖∇ u,t * 1∕̄( x t * )‖ 2 ] , we get the inequality (15). Furthermore, by plugging the expression of t given in (16) into (15), we get the final inequality (17).
Theorem 1 gives an overall bound on the weighted expected norm of the proximal map as the statistical measure of distance to convergence with respect to the number of iterations. The worst case bound is weighted by the possible range of the function the algorithm must traverse, i.e., from the starting value to the global minimum, as well as the error in the iterates in traversing this range due to the inaccuracy in the zeroth order function and noisy subgradient approximations. The order of the convergence is the same as the one reported in [2], however, the constant is larger, given the additional error in the quality of the steps. Note that as the convergence result is stated in a similar formalism, using the gradient of the Moreau envelope, we can interpret this approximate stationarity concept as given in [2, pp 3-4], namely that a small value of ‖∇ (x)‖ implies that x is near some point x (specifically ‖x −x‖ = ‖∇ (x)‖ ) satisfying a bound to the distance to stationarity, dist (0, (x)) ≤ ‖∇ (x)‖ . In this case an additional level of approximation to stationarity is added as we are taking the gradient of the smoothed proximal function, which is itself a perturbation of the original function.

Numerical results
In this section, we investigate the numerical performance of Algorithm 1 on a set of standard weakly convex optimization problems defined in [2]. In particular, we consider phase retrieval, which seeks to minimize the function, and blind deconvolution, which seeks to minimize Both of these applications are ones in which Common Random Numbers (CRNs) are a reasonable assumption, making two-point gradient estimates relevant. In particular, in (18), the pairs (a i , b i ) can be held constant between two function evaluations, and in (19), triplets (u i , v i , b i ) can be fixed as well.

Comparison with methods using a stochastic subgradient oracle
We first compare Algorithm 1 with the stochastic subgradient method and the stochastic proximal method in [2]. The goal in this set of experiments is understanding if our approach is competitive with those ones that use a stochastic subgradient oracle and how the practical behavior of the method fits with the theoretical analysis. We generate random Gaussian measurements in N(0, I d×d ) and a target signal x uniformly on the random sphere to compute b i with dimensions (d, m) = (10, 30), (20, 60), (40, 120) . We use fixed stepsizes t in the range [10 −6 , 10 −1 ] . We generate ten runs of each algorithm for every dimension and stepsize, and pick the best one according to the final objective value. The total number of iterations used in all cases is 100000.
We show the gap of the different methods when varying the stepsize for both phase retrieval (Fig. 1) and blind deconvolution (Fig. 2).
It is interesting that the zeroth order algorithm performs on par with the ones that use the stochastic subgradient oracle. In particular, our method is more robust to the choice of the stepsize than the stochastic subgradient method and it is competitive with the proximal method. In Figs. 3 and 4, we report the path of the objective values obtained with the stepsize equal to 10 −4 for the instances (d, m) = (10, 30) . These are nice examples of how good the zeroth order algorithm works when the stepsize is properly chosen.

Comparison with a naive stochastic variant of NOMAD
Now, in order to understand if Algorithm 1 is somehow competitive with other (stochastic) non-smooth methods from the DFO literature, we report here a preliminary comparison with a naive stochastic variant of NOMAD [16,17]. More specifically, we consider a mesh adaptive direct-search (MADS) that uses a unit-size sample for each We would like to notice that the MADS algorithm was originally developed for deterministic blackbox optimization. Recently a stochastic variant of this approach was proposed in [18].
the naive version of NOMAD for all precisions. We notice that, when = 10 −3 , NOMAD does not appear in the plots, hence it never satisfies the condition (20) for this precision. We further report, in Figs. 7 and 8, the box plots related to the function gaps obtained with the algorithms over the 100 instances considered in the tests. Those plots show that our algorithm gets very close to the optimal value for suitable choices of the stepsize. A zeroth order method for stochastic weakly convex optimization

Conclusion
In this paper we studied, for the first time, minimization of a stochastic weakly convex function without the presence of an oracle of a noisy estimate of the subgradient of the function, i.e., in the context of derivative-free or zeroth order optimization. We were able to derive theoretical convergence rate results on par with the standard methods for stochastic weakly convex optimization, and demonstrated the algorithm's efficacy on a couple of standard test cases. In expanding the scope of zeroth order optimization, we hope that this work highlights the potential of derivative free methods in general, and the two point smoothed function approximation technique in particular, to an increasingly wider class of problems.
Acknowledgements The authors are indebted to the referees for their constructive suggestions which helped to improve on the earlier version of this article.