Proximal gradient methods with inexact oracle of degree q for composite optimization

We introduce the concept of inexact first-order oracle of degree q for a possibly nonconvex and nonsmooth function, which naturally appears in the context of approximate gradient, weak level of smoothness and other situations. Our definition is less conservative than those found in the existing literature, and it can be viewed as an interpolation between fully exact and the existing inexact first-order oracle definitions. We analyze the convergence behavior of a (fast) inexact proximal gradient method using such an oracle for solving (non)convex composite minimization problems. We derive complexity estimates and study the dependence between the accuracy of the oracle and the desired accuracy of the gradient or of the objective function. Our results show that better rates can be obtained both theoretically and in numerical simulations when q is large.


Introduction
Optimization methods based on gradient information are widely used in applications where high accuracy is not desired, such as machine learning, data analysis, signal processing and statistics [2,4,16,20].The standard convergence analysis of gradient-based methods requires the availability of the exact gradient information for the objective function.However, in many optimization problems, one doesn't have access to exact gradients, e.g., the gradient is obtained by solving another optimization problem.In this case one can use inexact (approximate) gradient information.In this paper, we consider the following composite optimization problem: where h : E → R is a simple (i.e., proximal easy) closed convex function, F : E → R is a general lower semicontinuous function (possibly nonconvex) and there exist f ∞ such that f (x) ≥ f ∞ > −∞ for all x ∈ dom f = dom h.
We assume that we can compute exactly the proximal operator of h, and that we cannot have access to the (sub)differential of F , but can compute an approximation of it at any given point.Optimization algorithms with inexact first-order oracles are well studied in literature, see e.g., [3,[5][6][7][8][9]18].For example, [7] considers the case where h is the indicator function of a convex set Q and F is a convex function, and introduces the so-called inexact firstorder (δ, L)-oracle for F , i.e., for any y ∈ Q one can compute an inexact oracle consisting of a pair (F δ,L (y), g δ,L (y)) such that: Then, [7] introduces (fast) inexact first-order methods based on g δ,L (y) information and derives asymptotic convergence in function values of order O 1 k + δ or O 1 k 2 + kδ , respectively.One can notice that in the nonaccelerated scheme, the objective function accuracy decreases with k and asymptotically tends to δ, while in the accelerated scheme, there is error accumulation.Further, [9] considers problem (1) with domain of h bounded, and introduces the following inexact first-order oracle: Under the assumptions that F is nonconvex and h is convex but with bounded domain, [9] derives a sublinear rate in the squared norm of the generalized gradient mapping of order O 1 k + δ for an inexact proximal gradient method based on g δ,L (y) information.Note that all previous results provide convergence rates under the assumption of the boundedness of the domain of f (or equivalently of h).An open question is whether one can modify the previous definitions of inexact first-order oracle to cover both the convex and nonconvex settings, in order to be more general and to improve the convergence results of an algorithm based on this inexact information.More precisely, can one define a general inexact oracle that bridges the gap between exact oracle (exact gradient information) and the existing inexact first-order oracle definitions found in the literature [7,9]?In this paper we answer this question positively for both convex and nonconvex problems, introducing a suitable definition of inexactness for a first-order oracle for F involving some degree 0 ≤ q < 2, which consist in multiplying the constant δ in (2) with quantity x − y q (see Definition 2).We provide several examples that can fit in our proposed inexact first-order oracle framework, such as approximate gradient or weak level of smoothness, and show that, under this new definition of inexactness, we can remove the boundedness assumption of the domain of h.Then, we consider an inexact proximal gradient algorithm based on this inexact first-order oracle and provide convergence rates of order O 1 k + δ 2/(2−q) for q ∈ [0, 1) and for q ∈ [1, 2) for nonconvex composite problems, and of order O 1 k + δ k q/2 for q ∈ [0, 2) for convex composite problems in the form (1). We also derive convergence rates of order O( 1 ) for a fast inexact proximal gradient algorithm for solving the convex composite problem (1).Note that our convergence rates are better as q increases.In particular, for the inexact proximal gradient algorithm the power of δ in the convergence estimate is higher for q ∈ (0, 1) than for q = 0, while for q ≥ 1 the coefficients of δ diminishes with k.For the fast inexact proximal gradient method we show that there is no error accumulation for q ≥ 2/3.Hence, it is beneficial to consider an inexact first-order oracle of degree q > 0, as this allows us to work with less accurate approximation of the (sub)gradient of F when q is large.

Notations and preliminaries
In what follows R n denotes the finite-dimensional Euclidean space endowed with the standard inner product s, x = s T x and the corresponding norm s = s, s 1/2 for any s ∈ R n .For a proper lower semicontinuous convex function h we denote its domain by dom h = {x ∈ R n : h(x) < ∞} and its proximal operator as: prox γh (x) := arg min Next, we provide a few definitions and properties for subdifferential calculus in the nonconvex settings (see [13,17] for more details).
Definition 1 (Subdifferential): Let f : R n → R be a proper lower semicontinuous function.For a given x ∈ dom f , the Fréchet subdifferential of f at x, written ∂f (x), is the set of all vectors gx ∈ R n satisfying: Proximal gradient methods with inexact oracle When x / ∈ dom f , we set ∂f (x) = ∅.The limiting-subdifferential, or simply the subdifferential, of f at x ∈ dom f , written ∂f (x), is defined as [13]: Note that we have ∂f (x) ⊆ ∂f (x) for each x ∈ dom f .For f (x) = F (x) + h(x), if F and h are regular at x ∈ domf , then we have: (see Theorem 6 in [11] for more details).Further, if f is proper, lower semicontinuous and convex, then [17]: A function F : R n → R is L F -smooth if it is differentiable and its gradient is L F Lipschitz, i.e., satisfying: It follows immediately that [14]: Finally, let us recall the following classical weighted arithmetic-geometric mean inequality: if a, b are positive constants and 0 ≤ α 1 , α 2 ≤ 1, such that α 1 +α 2 = 1, then a α1 b α2 ≤ α 1 a + α 2 b.We will later use the following consequence for 3 Inexact first-order oracle of degree q In this section, we introduce our new inexact first-order oracle of degree 0 ≤ q < 2 and provide some nontrivial examples that fit into our framework.Our oracle can deal with general functions (possibly with unbounded domain), unlike the previous results in [7,9], but requires exact zero-order information.
Definition 2 The function F is equipped with an inexact first-order (δ, L)-oracle of degree q ∈ [0, 2) if for any y ∈ domf one can compute g δ,L,q (y) ∈ E * such that: To the best of our knowledge this definition of a first-order inexact oracle is new.The motivation behind this definition is to introduce a versatile inexact first-order oracle framework that bridges the gap between exact oracle (exact gradient information, i.e., q = 2) and the existing inexact first-order oracle definitions found in the literature (i.e., q = 0).More specifically, when q = 2, Definition 2 aligns with established results for smooth functions under exact gradient information, while when q = 0, our definition has been previously explored in the literature, see [7,9].Next, we provide several examples that satisfy Definition 2 naturally, and then we provide theoretical results showing the advantages of this new inexact oracle over the existing ones from the literature.
Example 1 (Smooth function with inexact first-order oracle).Let F be differentiable and its gradient be Lipschitz continuous with constant L F over domf .Assume that for any x ∈ dom f , one can compute g ∆,LF (x), an approximation of the gradient ∇F (x) satisfying: Then, F is equipped with (δ, L)-oracle of degree q = 1 as in Definition 2, with δ = ∆, L = L F , and g δ,L,1 (x) = g ∆,LF (x).Indeed, since F is L F -smooth, we get: It follows that: which completes our statement.Finite sum optimization problems appear widely in machine learning [4] and deal with an objective F (x) := N i=1 F i (x), where N is possibly large.In the stochastic setting, we sample stochastic derivatives at each iteration in order to form a mini-batch approximation for the gradient of F .If we define: where S is a subset of {1, . . ., N }, then condition (6) holds with probability at least (see Lemma 11 in [1]).
Remark 1 This example has been also considered in [7,9].However, in these papers δ depends on the diameter of the domain of f , assumed to be bounded.Our inexact oracle is more general and doesn't require boundedness of the domain of f , i.e., in our case δ = ∆, while in [7,9], δ = 2∆D, where D is the diameter of the domain of f .Hence, our definition is more natural in this setting.
Example 2 (Computations at shifted points) Let F be differentiable with Lipschitz continuous gradient with constant L F over domf .For any x ∈ domf we assume we can compute the exact value of the gradient, albeit evaluated at a shifted point x, different from x and satisfying x − x ≤ ∆.Then, F is equipped with a (δ, L)-oracle of degree q = 1 as in Definition 2, with g δ,L,1 (x) = ∇F (x), L = L F and δ = L F ∆. Indeed, since F is L F smooth, we have: where the second inequality follows from the Cauchy-Schwartz inequality.This proves our statement.
Remark 2 This example was also considered in [7,9], with the corresponding (δ, L)oracle having δ = L F ∆ 2 , L = 2L F and q = 0. Note that our L in Definition 2 is twice smaller than the corresponding L in [7,9].
Example 3 (Accuracy measures for approximate solutions) Let us consider a F that is L F smooth, given by: where A : E → E * is a linear operator and G(•) is a differentiable strongly concave function with concavity parameter κ > 0. Under these assumptions, the maximization problem max u∈U ψ(x, u) has only one optimal solution u * (x) for a given x.Moreover, F is convex and smooth with Lipschitz continuous gradient . Suppose that for any x ∈ domf , one can compute ux an approximate minimizer of ψ(x, u) such that u * (x) − ux ≤ ∆.Then, F is equipped with (δ, L)-oracle of degree q = 1 with δ = ∆ A , L = L F and g δ,L,1 (x) = Aux.Indeed, since F has Lipschitz-continuous gradient, we have: Hence, our statement follows.
Remark 3 This example was also considered in [7] with the corresponding (δ, L)− oracle having δ = ∆, L = 2L F and q = 0, while in our case, we have δ = ∆ A , L = L F and q = 1.
Example 4 (Weak level of smoothness) Let F be a proper lower semicontinuous function with the subdifferential ∂F (x) nonempty for all x ∈ domf .Assume that F satisfies the following Hölder condition with Hν < ∞: for all g(x) ∈ ∂F (x), g(y) ∈ ∂F (y), where x, y ∈ dom f and ν ∈ [0, 1].Then, F is equipped with (δ, L)-oracle of degree q as in Definition 2, with g δ,L,q (x) ∈ ∂F (x), for any arbitrary degree 0 ≤ q < 1 + ν and any accuracy δ > 0, and a constant L depending on δ given by: Indeed, we have from Hölder condition [14]: For any given δ > 0, we compute L(δ) such that the following inequality holds: Denote r = x − y and let λ ∈ (0, 1).Using the weighted arithmetic-geometric mean inequality with α 1 = λ and α 2 = 1 − λ, we have: Hence, for a given positive δ one may choose: and this is our statement.Note that if ν > 0, then we have ∂F (x) = {∇F (x)} for all x and thus F is differentiable.Indeed, letting y = x in (8) we get: g(x) = ḡ(x).This implies that the set ∂F (x) has a single element, thus F is differentiable.This example covers large classes of functions.Indeed, when ν = 1, we get functions with Lipschitzcontinuous subgradient.For ν < 1, we get a weaker level of smoothness.In particular, when ν = 0, we obtain functions whose subgradients have bounded variation.Clearly, the latter class includes functions whose subgradients are uniformly bounded by M (just take H 0 = 2M ).It also covers functions smoothed by local averaging and Moreau-Yosida regularization (see [7] for more details).We believe that the readers may find other examples that satisfy our Definition 2 of an inexact first-order oracle of degree q.

Inexact proximal gradient method
In this section, we introduce an inexact proximal gradient method based on the previous inexact oracle definition for solving (non)convex composite minimization problems (1).We derive complexity estimates for this algorithm and study the dependence between the accuracy of the oracle and the desired accuracy of the gradient or of the objective function.Hence, we consider the following Inexact Proximal Gradient Method (I-PGM).Note that Algorithm 1 Algorithm 1 Inexact proximal gradient method (I-PGM) 1.Given x 0 ∈ dom h and 0 ≤ q < 2.
For k ≥ 0 do: is an inexact proximal gradient method, where the inexactness comes from the approximate computation of the (sub)gradient of F , denoted g δ k ,L k ,q (x k ).In the next sections we analyze the convergence behavior of this algorithm when g δ k ,L k ,q (x k ) satisfies Definition 2.

Nonconvex convergence analysis
In this section we consider a nonconvex function F that admits an inexact first-order (δ, L)-oracle of degree q as in Definition 2. Using this definition and inequality (4), for all ρ > 0 we get the following upper bound: This inequality will play a key role in our convergence analysis.We define the gradient mapping at iteration k as g δ k ,L k ,q (x k ) + p k+1 , where , p k+1 is the subgradient of h at x k+1 coming from the optimality condition of the prox at x k ).Next we analyze the global convergence of I-PGM in the norm of the gradient mapping.We have the following theorem: Theorem 1 Let F be a nonconvex function admitting a (δ k , L k )-oracle of degree q ∈ [0, 2) at each iteration k, with δ k ≥ 0 and L k > 0 for all k ≥ 0. Let (x k ) k≥0 be generated by I-PGM and assume that α k ≤ 1 L k +qρ , for some arbitrary parameter ρ > 0.Then, there exists p k+1 ∈ ∂h(x k+1 ) such that: Proof Denote g δ k ,L k ,q (x k ) = g k .From the optimality conditions of the proximal operator defining x k+1 , we have: Further, from inequality (9), we get: where the second inequality follows from the convexity of h and p k+1 ∈ ∂h(x k+1 ), and the last inequality follows from the definition of α k .Hence, we get that: Summing up this inequality from j = 0 to j = k and using the fact that f (x k+1 ) ≥ f∞, where recall that f∞ denotes a finite lower bound for the objective function, we get: Hence, our statement follows.
For a particular choice of the algorithm parameters, we can get simpler convergence estimates.
Theorem 2 Let the assumptions of Theorem 1 hold and consider for all k ≥ 0: Then, we have: . (11) Proximal gradient methods with inexact oracle Proof Taking the minimum in the inequality (10), we get: Further, since we have: and similarly for δ j , we get: Since 0 ≤ ζ < 1, then we have for all k ≥ 0: It follows that for all k ≥ 0: Hence, our statement follows.
Let us analyze in more details the bound from Theorem 2. For simplicity, consider the case q = 1 (see Example 1).Then, we have: Since parameter ρ > 0 is a degree of freedom, minimizing the right hand side of the previous relation w.r.t.ρ we get an optimal choice ρ = Hence, replacing this expression for ρ in the last inequality, we get:

This bound is of order
Note that, if β > ζ, the gradient mapping min j=0:k g j + p j+1 2 converges regardless of the accuracy of the oracle δ and the convergence rate is of order . Note that this is not the case for q = 0, where the convergence rate is of order O 1 k + δ , see also [9].The following corollary provides a convergence rate for general q, but for a particular choice of the parameters ζ and β.
Proof Replacing ζ = β = 0 in inequality (11), we get: If 0 ≤ q < 2, then taking ρ = L in the last inequality we get the first statement.Further, if 1 ≤ q < 2, minimizing over ρ the second and the third terms of the right side of the last inequality yields the optimal choice ρ = Replacing this expression for ρ in the last inequality, we get: and this is the second statement.
Remark 4 Let us analyse in more details this convergence rate for Example 1.For q = 0, we have that δ = 2D∆ and L = L F , where D is the diameter of domf .Hence, the convergence rate in this case becomes: On the other hand, for q = 1, we have δ = ∆ and L = L F .Thus, we get the following convergence rate: Hence, if we want to achieve min j=0:k g j + p j+1 2 ≤ ǫ, for q = 0 we impose 4DL F ∆ ≤ ǫ/2, which implies that one needs to compute an approximate gradient with accuracy ∆ = O(ǫ), while for q = 1 we impose 2∆ 2 ≤ ǫ/2, meaning that one only needs to compute an approximate gradient with accuracy ∆ = O(ǫ 1/2 ).Hence, for this example, it is more natural to use our inexact first-order oracle definition for q = 1 than for q = 0, since it requires less accuracy for approximating the true gradient.
Note that in the second result of Corollary 1, the parameter ρ depends on the difference ∆ 0 = f (x 0 ) − f ∞ , and, usually, f ∞ is unknown.In practice, we can approximate ∆ 0 by using an estimate for f ∞ in place of its exact value.For example, one can consider [10].Under this setting, the sequence ε k and the iterates of I-PGM corresponding to the case of the second result of Corollary 1 are updated as follows: This process is well defined, i.e., the "while" step finishes in a finite number of iterations.Indeed, one can observe that if for all k ≥ 0. Hence, we can still derive a convergence rate for the second result of Corollary 1 using this adaptive process since one can observe that: Additionally, we have the following bound on ∆ k 0 : Hence, we can replace in (10) the difference ∆ 0 = f (x 0 ) − f ∞ with ∆ k 0 and then the second statement of Corollary 1 remains valid with ∆ k 0 instead of ∆ 0 .
Remark 5 We observe that for q = 0 we recover the same convergence rate as in [9].However, our result does not require the boundedness of the domain of f , while in [9] the rate depends explicitly of the diameter of the domain of f .Moreover, for q > 0 our convergence bounds are better than in [9], i.e., the coefficients of the terms in δ are either smaller or even tend to zero, while in [9] they are always constant.
Further, let us consider the case of Example 4, where F satisfies the Hölder condition with constant ν ∈ (0, 1] and β = ζ = 0. We have shown that for any δ > 0 this class of functions can be equipped with a (δ, L)-oracle of degree Example 4 for the expression of the constant C(H ν , q)).In view of the first result of Corollary 1, after k iterations, we have: where C 1 := 2(q + 1)∆ 0 C(H ν , q) and C 2 = (q + 1)(2 − q)C(H ν , q) 2−2q 2−q .Since in this example we can choose δ, its optimal value can be computed from the following equation: Hence, we get: where . Thus, replacing this optimal choice of δ in the last inequality, we get: .
Remark 6 Note that our convergence rate of order O(k − 2ν 1+ν ) for Algorithm 1 (I-PGM) for nonconvex problems having the first term F with a Hölder continuous gradient (Example 4) recovers the rate obtained in [9] under the same settings.
Finally, let us now show that when the gradient mapping is small enough, i.e., g k + p k+1 is small, x k+1 is a good approximation for a stationary point of problem (1).Note that any choice α k ≤ 1 L+qρ k yields: Hence, if the gradient mapping is small, then the norm of the difference x k+1 − x k is also small.Theorem 3 Let (x k ) k≥0 be generated by I-PGM and let p k+1 ∈ ∂h(x k+1 ).Assume that we are in the case of Example 1.Then, we have: Further, if we are in the case of Example 4, then we have: Proof Let us consider Example 1, where F is L F smooth and h is convex.Since ∇F (x k+1 ) + p k+1 ∈ ∂f (x k+1 ), then we have: Further, let us assume that we are in the case of Example 4.Then, we have g(x k ) ∈ ∂F (x k ).Further, let g(x k+1 ) ∈ ∂F (x k+1 ), then we get: This proves our statements.
Thus, for 1 α k (x k+1 − x k ) = g k + p k+1 small, x k+1 is an approximate stationary point of problem (1).Note that our convergence rates from this section are better as q increases, i.e., the terms depending on δ are smaller for q > 0 than for q = 0.In particular, the power of δ in the convergence estimate is higher for q ∈ (0, 1) than for q = 0, while for q ≥ 1 the coefficients of δ even diminish with k.Hence, it is beneficial to have an inexact first-order oracle of degree q > 0, as this allows us to work with less accurate approximation of the (sub)gradient of the nonconvex function F than for q = 0.

Convex convergence analysis
In this section, we analyze the convergence rate of I-PGM for problem (1), where F is now assumed to be a convex function.By adding extra information to the oracle (5), we consider the following modification of Definition 2: Definition 3 Given a convex function F , then it is equipped with an inexact firstorder (δ, L)-oracle of degree 0 ≤ q < 2 if for any y ∈ domf we can compute a vector g δ,L,q (y) such that: Note that Example 4 satisfies this definition.In (12), the zero-order information is considered to be exact.This is not the case in [7], which considers the particular choice q = 0 .Further, the first-order information g δ,L,q is a subgradient of f at y in (12), while in [7] it is a δ-subgradient.However, using this inexact first-order oracle of degree q, I-PGM provides better rates compared to [7].From ( 12) and (4), we get: for all ρ > 0. Next, we analyze the convergence rate of I-PGM in the convex setting.We have the following convergence rate: Corollary 2 Let F be a convex function admitting a (δ, L)-oracle of degree q ∈ [0, 2) (see Definition 3).Let (x k ) k≥0 be generated by I-PGM and assume that α k = 1  L+qρ , with ρ > 0. Define xk = k i=0 xi+1 k+1 and R = x 0 − x * .Then, we have: Proof Follows from (13) and Theorem 2 in [7].
Since we have the freedom of choosing ρ, let us minimize the right hand side of ( 14) over ρ.Then, ρ must satisfy qR 2 2k − qδ . Finally, fixing the number of iterations k and replacing this expression in equation ( 14), we get: One can notice that our rate in function values is of order O(k −1 + δk − q 2 ), while in [7] the rate is of order O(k −1 + δ).Hence, when q > 0, regardless of the accuracy of the oracle, our second term diminishes, while in [7] it remains constant.Hence, our new definition of inexact oracle of degree q, Definition 3, is also beneficial in the convex case when analysing proximal gradient type methods, i.e., large q yields better rates.
We also consider an extension of the fast inexact projected gradient method from [7], where the projection is replaced by a proximal step with respect to the function h (see [15]), called FI-PGM.Note that the inexactness in FI-PGM comes from the approximate computation of the (sub)gradient of F , denoted g δ k ,L k ,q (x k ), as given in Definition 3. Let (θ k ) k≥0 be a sequence such that: Then, the fast inexact proximal gradient method (FI-PGM) is as follows: Algorithm 3 Fast inexact proximal gradient method (FI-PGM) 1. Given x 0 ∈ dom h, θ 0 ∈ (0, 1] and 0 ≤ q < 2. For k ≥ 0 do: 2. Choose δ k , L k and α k .Obtain g δ k ,L k ,q (x k ).

Compute y
θi Li g δ k ,L k ,q (x i ), x − x i + h(x). 5. Choose θ k+1 satisfying condition (15) and compute Using a similar proof as in [7], we get the following convergence rate for FI-PGM algorithm: Corollary 3 Let F satisfy the assumptions of Lemma 2 and (y k ) k≥0 be generated by FI-PGM.Then, for all ρ > 0, we have the following rate: Proof The proof follows from (13) and Theorem 4 in [7].
The optimal ρ in the right hand side of inequality ( 16) is Further, replacing ρ with its optimal value in the inequality ( 16), we get Hence, if q > 2 3 , then FI-PGM doesn't have error accumulation under our inexact oracle as the rate is of order O k −2 + δk 1− 3q 2 , while in [7] the FI-PGM scheme always displays error accumulation, as the convergence rate is of order O(k −2 + δk).Therefore, the same conclusion holds as for I-PGM, i.e., for the FI-PGM scheme in the convex setting it is beneficial to have an inexact first-order oracle with large degree q.
Remark 7 In our Definition 2 we have considered exact zero-order information.However, it is possible to change this definition considering also inexact zero-order information for the nonconvex case.More precisely, we can change Definition 2 as follows With this new definition, the convergence result in Theorem 1 becomes: Hence the rate in this case is also influenced by the inexactness of the zero-order information (i.e., δ 0 ).Note that for the convex case, the previous extension is not possible in Definition 3 when q > 0, since we must have: which implies for x = y that F (x) = F δ0 (x).Since we want to have consistency between Definitions 2 and 3, we have chosen to work with the exact zero-order information in our previous nonconvex convergence analysis.

Numerical simulations
In this section, we evaluate the performance of I-PGM for a composite problem arising in image restoration.Namely, we consider the following nonconvex optimization problem [12]: where R > 0, b ∈ R N and a i ∈ R n , for i = 1 : N .In image restoration, b represents the noisy blurred image and A = (a 1 , • • • , a N ) ∈ R n×N is a blur operator [12].This problem fits into our general problem (1), with , and h(x) is the indicator function of the bounded convex set {x : x 1 ≤ R}.We generate the inexact oracle by adding normally distributed random noise δ to the true gradient, i.e., g δ,L,q (x) := ∇F (x) + δ.This is a particular case of Example 1.However, for all x and y satisfying x ≤ R, y ≤ R, we have the following: Thus, this example satisfies Definition 2 for all q ∈ [0, 1].We apply I-PGM for this particular example where we consider three choices for the degree q: 0, 1/2 and 1. Recall that the convergence rate of I-PGM with constant step size is (see Corollary 1, first statement): At each iteration of I-PGM we need to solve the following convex subproblem: This subproblem has a closed form solution (see e.g., [19]).We compare I-PGM with constants step size α k = 1 2(LF +qρ) and ρ = L F for three choices of q = 0, 1/2, 1 and three choices of noise norm δ ≤ 0.1, 1, 3, respectively.The results are given in Fig. 1 (dotted lines), where we plot the evolution of the error min j=0:k 1 α k (x j+1 − x j ) 2 , which corresponds to the gradient mapping.In the same figure we also plot the theoretical bounds (18) for q = 0, 1/2, 1 (full lines).Our main figures are Figure 1(a), (c), and (d), while Figure 1(b) is a subfigure (zoom) of Figure 1(a), displaying only the first 300 iterations.Moreover, one can see in these main figures (i.e., Figure 1(a), (c), and (d)) that the behaviour of our algorithm for q = 1 is better than for q = 1/2.Similarly, the behaviour of our algorithm for q = 1/2 is better than for q = 0.One can observe these better behaviours after 300 iterations when the error δ is small (see Figure 1(c) and (d)).However, when the error δ is large, we need to perform a larger number of iterations before we can observe these behaviours, (see Figure 1(a) and (b)).This is natural, since large errors on the gradient approximation must have impact on the convergence speed.Hence, as the degree q increases or the norm of the noise decreases, better accuracies for the norm of the gradient mapping can be achieved, which supports our theoretical findings.Moreover, from the numerical simulations, one can observe that the gap between the theoretical and the practical bounds is large in Figure 1(c) and (d).We believe that this happens because, in the convergence analysis, the theoretical bounds are derived under worst-case scenarios (i.e., the convergence analysis must account for the worst case direction generated by the inexact first-order oracle, while in practical implementations, which often involve randomness, one usually doesn't encounter these worst-case directions).However, the simulations in Figure 1(a) show that the gap between the theoretical bounds and the practical behavior is not too large.More precisely, we have generated at each iteration 100 random directions and, in order to update the new point, we have chosen the worst direction with respect to the gradient mapping (i.e., the largest) x k+1 − x k ).The results are given in Figure 1(a), where one can see that the theoretical and practical bounds are getting closer for sufficiently large number of iterations.

1 (d) δ ≤ 0. 1 Fig. 1 :
Fig. 1: Practical (dotted lines) and theoretical (full lines) performances of the I-PGM algorithm for different choices of q and δ, and with R = 4. Figure (b) represents a zoom of the left corner from Figure (a).