1 Introduction

Optimization methods based on gradient information are widely used in applications where high accuracy is not desired, such as machine learning, data analysis, signal processing and statistics [2, 4, 16, 20]. The standard convergence analysis of gradient-based methods requires the availability of the exact gradient information for the objective function. However, in many optimization problems, one doesn’t have access to exact gradients, e.g., the gradient is obtained by solving another optimization problem. In this case one can use inexact (approximate) gradient information. In this paper, we consider the following composite optimization problem:

$$\begin{aligned} \min \limits _{x\in \mathbb {E}} f(x): = F(x) + h(x), \end{aligned}$$
(1)

where \(h: \mathbb {E} \rightarrow \bar{\mathbb {R}}\) is a simple (i.e., proximal easy) closed convex function, \(F: \mathbb {E} \rightarrow \mathbb {R}\) is a general lower semicontinuous function (possibly nonconvex) and there exist \(f_\infty \) such that \(f(x)\ge f_\infty {>-\infty }\) for all \(x \in \text {dom} \,f = \text {dom}\, h\). We assume that we can compute exactly the proximal operator of h, and that we cannot have access to the (sub)differential of F, but can compute an approximation of it at any given point. Optimization algorithms with inexact first-order oracles are well studied in literature, see e.g., [3, 5,6,7,8,9, 18]. For example, [7] considers the case where h is the indicator function of a convex set Q and F is a convex function, and introduces the so-called inexact first-order \((\delta ,L)\)-oracle for F, i.e., for any \(y\in Q\) one can compute an inexact oracle consisting of a pair \((F_{\delta ,L}(y), g_{\delta , L}(y))\) such that:

$$\begin{aligned} 0\le F(x) - \Big (F_{\delta ,L }(y) + \langle g_{\delta ,L}(y),x-y \rangle \Big )\le \frac{L}{2}\Vert x-y\Vert ^{2} + \delta \quad \forall x \in Q. \end{aligned}$$
(2)

Then, [7] introduces (fast) inexact first-order methods based on \(g_{\delta , L}(y)\) information and derives asymptotic convergence in function values of order \(\mathcal {O}\left( \frac{1}{k} + \delta \right) \) or \(\mathcal {O}\left( \frac{1}{k^2} + k\delta \right) \), respectively. One can notice that in the nonaccelerated scheme, the objective function accuracy decreases with k and asymptotically tends to \(\delta \), while in the accelerated scheme, there is error accumulation. Further, [9] considers problem (1) with domain of h bounded, and introduces the following inexact first-order oracle:

$$\begin{aligned} \vert F(x) - F_{\delta ,L}(x)\vert \le \delta ,\;\; F(x) - F_{\delta ,L}(y) - \langle g_{\delta ,L}(y),x-y \rangle \le \frac{L}{2}\Vert x - y \Vert ^2 + \delta . \end{aligned}$$

Under the assumptions that F is nonconvex and h is convex but with bounded domain, [9] derives a sublinear rate in the squared norm of the generalized gradient mapping of order \(\mathcal {O}\left( \frac{1}{k} + \delta \right) \) for an inexact proximal gradient method based on \(g_{\delta , L}(y)\) information. Note that all previous results provide convergence rates under the assumption of the boundedness of the domain of f (or equivalently of h). An open question is whether one can modify the previous definitions of inexact first-order oracle to cover both the convex and nonconvex settings, in order to be more general and to improve the convergence results of an algorithm based on this inexact information. More precisely, can one define a general inexact oracle that bridges the gap between exact oracle (exact gradient information) and the existing inexact first-order oracle definitions found in the literature [7, 9]? In this paper we answer this question positively for both convex and nonconvex problems, introducing a suitable definition of inexactness for a first-order oracle for F involving some degree \(0\le q < 2\), which consist in multiplying the constant \(\delta \) in (2) with quantity \(\Vert x - y \Vert ^q\) (see Definition 2). We provide several examples that can fit in our proposed inexact first-order oracle framework, such as approximate gradient or weak level of smoothness, and show that, under this new definition of inexactness, we can remove the boundedness assumption of the domain of h. Then, we consider an inexact proximal gradient algorithm based on this inexact first-order oracle and provide convergence rates of order \(\mathcal {O}\left( \frac{1}{k} + \delta ^{2/(2-q)} \right) \) for \(q \in [0, 1)\) and \(\mathcal {O}\left( \frac{1}{k} + \frac{\delta }{k^{q/2}}+ \frac{\delta ^2}{k^{q-1}} \right) \) for \(q \in [1, 2)\) for nonconvex composite problems, and of order \(\mathcal {O}\left( \frac{1}{k} + \frac{\delta }{k^{q/2}} \right) \) for \(q \in [0, 2)\) for convex composite problems in the form (1). We also derive convergence rates of order \(\mathcal {O}(\frac{1}{k^2} + \frac{\delta }{k^{(3q-2)/2}})\) for a fast inexact proximal gradient algorithm for solving the convex composite problem (1). Note that our convergence rates are better as q increases. In particular, for the inexact proximal gradient algorithm the power of \(\delta \) in the convergence estimate is higher for \(q \in (0,1)\) than for \(q=0\), while for \(q \ge 1\) the coefficients of \(\delta \) diminishes with k. For the fast inexact proximal gradient method we show that there is no error accumulation for \(q \ge 2/3\). Hence, it is beneficial to consider an inexact first-order oracle of degree \(q > 0\), as this allows us to work with less accurate approximation of the (sub)gradient of F when q is large.

2 Notations and preliminaries

In what follows \(\mathbb {R}^n\) denotes the finite-dimensional Euclidean space endowed with the standard inner product \(\langle s, x\rangle = s^Tx\) and the corresponding norm \(\Vert s \Vert = \langle s,s\rangle ^{1/2}\) for any \(s\in \mathbb {R}^n\). For a proper lower semicontinuous convex function h we denote its domain by \(\text {dom}\;h = \lbrace x\in \mathbb {R}^n:\; h(x)< \infty \rbrace \) and its proximal operator as:

$$\begin{aligned} \text {prox}_{\gamma h}(x) := \text {arg min}_{y\in \text {dom}\;h} h(y) + \frac{1}{2\gamma }\Vert x - y\Vert ^2. \end{aligned}$$

Next, we provide a few definitions and properties for subdifferential calculus in the nonconvex settings (see [13, 17] for more details).

Definition 1

(Subdifferential): Let \(f: \mathbb {R}^n \rightarrow \bar{\mathbb {R}}\) be a proper lower semicontinuous function. For a given \(x \in \text {dom} \; f\), the Fr\(\acute{e}\)chet subdifferential of f at x, written \(\widehat{\partial }f(x)\), is the set of all vectors \(g_{x}\in \mathbb {R}^n\) satisfying:

$$\begin{aligned} \lim _{x\ne y}\inf \limits _{y\rightarrow x}\frac{f(y) - f(x) - \langle g_{x}, y - x\rangle }{\Vert x-y\Vert }\ge 0. \end{aligned}$$

When \(x \notin \text {dom} \; f\), we set \(\widehat{\partial } f(x) = \emptyset \). The limiting-subdifferential, or simply the subdifferential, of f at \(x\in \text {dom} \, f\), written \(\partial f(x)\), is defined as [13]:

$$\begin{aligned} \partial f(x):= \left\{ g_{x}\in \mathbb {R}^n\!\!: \exists x^{k}\rightarrow x, f(x^{k})\rightarrow f(x) \; \text {and} \; \exists g_{x}^{k}\in \widehat{\partial } f(x^{k}) \;\; \text {such that} \;\; g_{x}^{k} \rightarrow g_{x}\right\} . \end{aligned}$$

Note that we have \(\widehat{\partial }f(x)\subseteq \partial f(x)\) for each \(x\in \text {dom}\,f\). For \(f(x) = F(x) + h(x)\), if F and h are regular at \(x\in \text {dom}f\), then we have:

$$\begin{aligned} \partial f(x) = \partial F(x) + \partial h(x). \end{aligned}$$

(see Theorem 6 in [11] for more details). Further, if f is proper, lower semicontinuous and convex, then [17]:

$$\begin{aligned} \partial f(x) = \partial \hat{f}(x) = \{\lambda \in \mathbb {R}^n: f(y)\ge f(x) + \langle \lambda ,y-x \rangle \;\; \forall y\in \mathbb {R}^n\}. \end{aligned}$$

A function \(F: \mathbb {R}^n\rightarrow \mathbb {R}\) is \(L_F\)-smooth if it is differentiable and its gradient is \(L_F\) Lipschitz, i.e., satisfying:

$$\begin{aligned} \Vert \nabla F(x) - \nabla F(y) \Vert \le L_F \Vert x - y \Vert ,\quad \forall x,y\in \mathbb {R}^n. \end{aligned}$$

It follows immediately that [14]:

$$\begin{aligned} \vert F(x) - (F(y) + \langle \nabla F(y),x-y \rangle ) \vert \le \frac{L_F}{2} \Vert x - y \Vert ^{2}, \quad \forall x,y\in \mathbb {R}^n. \end{aligned}$$
(3)

Finally, let us recall the following classical weighted arithmetic–geometric mean inequality: if ab are positive constants and \(0\le \alpha _1,\alpha _2 \le 1\), such that \(\alpha _1 + \alpha _2 = 1\), then \(a^{\alpha _1}b^{\alpha _2}\le \alpha _1 a + \alpha _2 b\). We will later use the following consequence for \(\rho >0\), \(a=\rho \Vert x-y\Vert ^2\), \(b=\frac{\delta _{q}^{\frac{2}{2-q}}}{\rho ^{\frac{q}{2-q}}}\), \(\alpha _1 = \frac{q}{2}\) and \(\alpha _2 = \frac{2 - q}{2}\):

$$\begin{aligned} {\delta _q} \Vert x-y\Vert ^q \le \frac{q\rho \Vert x-y\Vert ^2}{2} + \frac{(2-q)\delta _{q}^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$
(4)

3 Inexact first-order oracle of degree q

In this section, we introduce our new inexact first-order oracle of degree \(0\le q <2\) and provide some nontrivial examples that fit into our framework. Our oracle can deal with general functions (possibly with unbounded domain), unlike the previous results in [7, 9], but requires exact zero-order information.

Definition 2

The function F is equipped with an inexact first-order \((\delta , L)\)-oracle of degree \(q \!\in \! [0,2)\) if for any \(y \!\in \! \text {dom} f\) one can compute \( g_{\delta ,L,q}(y) \in \mathbb {E}^{*}\) such that:

$$\begin{aligned} F(x) \!-\! \left( F(y) + \langle g_{\delta ,L,q}(y),x-y \rangle \right) \!\le \! \frac{L}{2}\Vert x-y \Vert ^{2} + \delta \Vert x - y \Vert ^q \;\;\; \forall x \!\in \! \text {dom} f. \end{aligned}$$
(5)

To the best of our knowledge this definition of a first-order inexact oracle is new. The motivation behind this definition is to introduce a versatile inexact first-order oracle framework that bridges the gap between exact oracle (exact gradient information, i.e., \(q=2\)) and the existing inexact first-order oracle definitions found in the literature (i.e., \(q=0\)). More specifically, when \(q=2\), Definition 2 aligns with established results for smooth functions under exact gradient information, while when \(q=0\), our definition has been previously explored in the literature, see [7, 9]. Next, we provide several examples that satisfy Definition 2 naturally, and then we provide theoretical results showing the advantages of this new inexact oracle over the existing ones from the literature.

Example 1

(Smooth function with inexact first-order oracle). Let F be differentiable and its gradient be Lipschitz continuous with constant \(L_F\) over \(\text {dom}f\). Assume that for any \(x\in \text {dom}\,f\), one can compute \(g_{\Delta ,L_F}(x)\), an approximation of the gradient \(\nabla F(x)\) satisfying:

$$\begin{aligned} \Vert \nabla F(x) - g_{\Delta ,L_F}(x) \Vert \le \Delta . \end{aligned}$$
(6)

Then, F is equipped with \((\delta ,L)\)-oracle of degree \(q=1\) as in Definition 2, with \(\delta = \Delta \), \(L=L_F\), and \(g_{\delta ,L,1}(x)= g_{\Delta ,L_F}(x)\). Indeed, since F is \(L_F\)-smooth, we get:

$$\begin{aligned} F(y) - F(x) - \langle \nabla F(x),y-x \rangle&\le \frac{L_F}{2}\Vert y - x \Vert ^2. \end{aligned}$$

It follows that:

$$\begin{aligned} F(y)\! -\! F(x)\! - \!\langle g_{\Delta ,L_F}(x),y \!-\! x \rangle \!&\le \frac{L_F}{2}\Vert y \!-\! x \Vert ^2 \!+\! \Vert \nabla F(x) \!-\! g_{\Delta ,L_F}(x) \Vert \,\Vert y \!-\! x \Vert \\&\le \frac{L_F}{2}\Vert y - x \Vert ^2 + \Delta \Vert y - x \Vert . \end{aligned}$$

which completes our statement. Finite sum optimization problems appear widely in machine learning [4] and deal with an objective \(F(x): = \sum _{i=1}^{N} F_i(x)\), where N is possibly large. In the stochastic setting, we sample stochastic derivatives at each iteration in order to form a mini-batch approximation for the gradient of F. If we define:

$$\begin{aligned} g_{S}(x) = \frac{1}{\vert S\vert }\sum \limits _{j\in S} \nabla F_i(x), \end{aligned}$$
(7)

where S is a subset of \(\{1,\ldots ,N\}\), then condition (6) holds with probability at least \(1 - \Delta \) if the batch size S satisfies \(\vert S \vert = \mathcal {O}\left( \frac{\Delta ^2}{L_{F}^2} + \frac{1}{N} \right) ^{-1}\) (see Lemma 11 in [1]).

Remark 1

This example has been also considered in [7, 9]. However, in these papers \(\delta \) depends on the diameter of the domain of f, assumed to be bounded. Our inexact oracle is more general and doesn’t require boundedness of the domain of f, i.e., in our case \(\delta = \Delta \), while in [7, 9], \(\delta = 2\Delta D\), where D is the diameter of the domain of f. Hence, our definition is more natural in this setting.

Example 2

(Computations at shifted points) Let F be differentiable with Lipschitz continuous gradient with constant \(L_F\) over \(\text {dom}f\). For any \(x\in \text {dom}f\) we assume we can compute the exact value of the gradient, albeit evaluated at a shifted point \(\bar{x}\), different from x and satisfying \(\Vert x - \bar{x}\Vert \le \Delta \). Then, F is equipped with a \((\delta ,L)\)-oracle of degree \(q=1\) as in Definition 2, with \(g_{\delta ,L,1}(x) = \nabla F(\bar{x})\), \(L = L_F\) and \(\delta = L_F \Delta \). Indeed, since F is \(L_F\) smooth, we have:

$$\begin{aligned} F(y)&\le F(x) + \langle \nabla F(x),y-x \rangle + \frac{L_F}{2} \Vert y - x\Vert ^2,\\&= F(x) + \langle \nabla F(\bar{x}),y-x \rangle + \langle \nabla F(x) - \nabla F(\bar{x}),y-x \rangle + \frac{L_F}{2}\Vert y - x\Vert ^2,\\&\le F(x) + \langle \nabla F(\bar{x}),y-x \rangle + \frac{L_F}{2}\Vert y - x\Vert ^2 + L_F\Vert x - \bar{x}\Vert \Vert y - x\Vert , \end{aligned}$$

where the second inequality follows from the Cauchy-Schwartz inequality. This proves our statement.

Remark 2

This example was also considered in [7, 9], with the corresponding \((\delta ,L)\)-oracle having \(\delta = L_F\Delta ^2 \), \(L = 2L_F\) and \(q=0\). Note that our L in Definition 2 is twice smaller than the corresponding L in [7, 9].

Example 3

(Accuracy measures for approximate solutions) Let us consider a F that is \(L_F\) smooth, given by:

$$\begin{aligned} F(x) = \max _{u\in U} \psi (x,u):= \max _{u\in U} G(u) + \langle Au,x\rangle , \end{aligned}$$

where \(A:\mathbb {E}\rightarrow \mathbb {E}^*\) is a linear operator and \(G(\cdot )\) is a differentiable strongly concave function with concavity parameter \(\kappa >0\). Under these assumptions, the maximization problem \(\max _{u\in U} \psi (x,u)\) has only one optimal solution \(u^{*}(x)\) for a given x. Moreover, F is convex and smooth with Lipschitz continuous gradient \(\nabla F(x) = \nabla _x \psi (x,u^{*}(x)) = Au^{*}(x)\) having Lipschitz constant \(L_F = \frac{1}{\kappa }\Vert A\Vert ^2\) [7]. Suppose that for any \(x\in \text {dom}f\), one can compute \(u_x\) an approximate minimizer of \(\psi (x,u)\) such that \(\Vert u^{*}(x) - u_x\Vert \le \Delta \). Then, F is equipped with \((\delta ,L)\)-oracle of degree \(q=1\) with \(\delta = \Delta \Vert A\Vert \), \(L = L_F\) and \(g_{\delta , L, 1}(x) = Au_x\). Indeed, since F has Lipschitz-continuous gradient, we have:

$$\begin{aligned} F(y)&\le F(x) + \langle \nabla F(x),y - x \rangle + \frac{L_F}{2}\Vert y - x \Vert ^2,\\&= F(x) + \langle \nabla _x \psi (x,u^{*}(x)), y - x \rangle + \frac{L_F}{2}\Vert y - x\Vert ^2,\\&= F(x) + \langle Au^{*}(x), y - x \rangle + \frac{L_F}{2}\Vert y - x\Vert ^2,\\&= F(x) + \langle Au_x ,y - x \rangle + \langle A(u^{*}(x) - u_x), y - x \rangle + \frac{L_F}{2}\Vert y - x\Vert ^2,\\&\le F(x) + \langle Au_x ,y - x \rangle + \Vert A\Vert \Vert u^{*}(x) - u_x\Vert \Vert y - x\Vert + \frac{L_F}{2}\Vert y - x\Vert ^2,\\&\le F(x) + \langle Au_x ,y - x \rangle + \Delta \Vert A\Vert \Vert y - x\Vert + \frac{L_F}{2}\Vert y - x\Vert ^2. \end{aligned}$$

Hence, our statement follows.

Remark 3

This example was also considered in [7] with the corresponding \((\delta ,L)-\) oracle having \(\delta = \Delta \), \(L = 2L_F\) and \(q=0\), while in our case, we have \(\delta = \Delta \Vert A\Vert \), \(L = L_F\) and \(q = 1\).

Example 4

(Weak level of smoothness) Let F be a proper lower semicontinuous function with the subdifferential \(\partial F(x) \) nonempty for all \(x\in \text {dom}f\). Assume that F satisfies the following Hölder condition with \(H_{\nu }<\infty \):

$$\begin{aligned} \Vert g(x) - g(y)\Vert \le H_{\nu }\Vert y-x\Vert ^\nu , \end{aligned}$$
(8)

for all \(g(x)\in \partial F(x)\), \(g(y)\in \partial F(y)\), where \(x,y\in \text {dom}\,f\) and \(\nu \in [0,1]\). Then, F is equipped with \((\delta ,L)\)-oracle of degree q as in Definition 2, with \(g_{\delta ,L,q}(x) \in \partial F(x)\), for any arbitrary degree \(0 \le q<1+\nu \) and any accuracy \(\delta >0\), and a constant L depending on \(\delta \) given by:

$$\begin{aligned}L(\delta ) = \frac{1+\nu -q}{2-q}\left( \frac{H_{\nu }}{1+\nu } \right) ^{\frac{2-q}{1+\nu -q}}\left( \frac{1-\nu }{\delta (2-q)} \right) ^{\frac{1-\nu }{1+\nu - q}}.\end{aligned}$$

Indeed, we have from Hölder condition [14]:

$$\begin{aligned} F(x) - F(y) - \langle g(y),x-y \rangle \le \frac{H_{\nu }}{1 + \nu } \Vert x-y\Vert ^{1+\nu }. \end{aligned}$$

For any given \(\delta >0\), we compute \(L(\delta )\) such that the following inequality holds:

$$\begin{aligned} \frac{H_{\nu }}{1 + \nu }\Vert x-y\Vert ^{1+\nu }\le \frac{L(\delta )}{2} \Vert x-y\Vert ^2 + \delta \Vert x-y\Vert ^q. \end{aligned}$$

Denote \(r=\Vert x-y\Vert \) and let \(\lambda \in (0,1)\). Using the weighted arithmetic–geometric mean inequality with \(\alpha _1 = \lambda \) and \(\alpha _2 = 1-\lambda \), we have:

$$\begin{aligned} \frac{L(\delta )r^2}{2} + \delta r^q&=\lambda \frac{L(\delta )}{2\lambda } r^2 + (1-\lambda )\frac{\delta }{1-\lambda } r^q\\&\ge \left( \frac{L(\delta )}{2\lambda } r^2\right) ^{\lambda } \left( \frac{\delta }{1 \!-\! \lambda }r^q \right) ^{1 \!-\! \lambda } \!=\! \left( \frac{L(\delta )}{2\lambda }\right) ^{\lambda } \left( \frac{\delta }{1 \!-\! \lambda }\right) ^{1 \!-\! \lambda } \!\! r^{2\lambda + q(1 \!-\! \lambda )}.\\ \end{aligned}$$

Thus \(\frac{H_{\nu }}{1+\nu }=\left( \frac{L(\delta )}{2\lambda }\right) ^{\lambda } \left( \frac{\delta }{1-\lambda }\right) ^{1-\lambda }\) and \(1 + \nu =2\lambda + q(1-\lambda )\). It follows that \(\lambda = \frac{1+\nu -q}{2-q}\), \(1-\lambda = \frac{1-\nu }{2-q}\) and \(\frac{1}{\lambda } - 1 = \frac{1-\nu }{1+\nu - q}\). Hence, for a given positive \(\delta \) one may choose:

$$\begin{aligned} L(\delta ) \!=\! 2\lambda \left( \frac{H_{\nu }}{1 \!+\! \nu }\right) ^{\frac{1}{\lambda }}\left( \frac{1 \!-\! \lambda }{\delta }\right) ^{\frac{1}{\lambda } \!-\! 1} \!=\! \frac{1 \!+\! \nu \!-\! q}{2 \!-\! q}\left( \frac{H_{\nu }}{1 \!+\! \nu }\right) ^{\frac{2 \!-\! q}{1 \!+\! \nu \!-\! q}}\left( \frac{1 \!-\! \nu }{\delta (2 \!-\! q)}\right) ^{\frac{1 \!-\! \nu }{1 \!+\! \nu \!-\! q}}, \end{aligned}$$

and this is our statement. Note that if \(\nu > 0\), then we have \(\partial F(x) = \lbrace \nabla F(x) \rbrace \) for all x and thus F is differentiable. Indeed, letting \(y=x\) in (8) we get: \(g(x) = \bar{g}(x)\). This implies that the set \(\partial F(x)\) has a single element, thus F is differentiable. This example covers large classes of functions. Indeed, when \(\nu = 1\), we get functions with Lipschitz-continuous subgradient. For \(\nu < 1\), we get a weaker level of smoothness. In particular, when \(\nu = 0\), we obtain functions whose subgradients have bounded variation. Clearly, the latter class includes functions whose subgradients are uniformly bounded by M (just take \(H_0 = 2M\)). It also covers functions smoothed by local averaging and Moreau-Yosida regularization (see [7] for more details). We believe that the readers may find other examples that satisfy our Definition 2 of an inexact first-order oracle of degree q.

4 Inexact proximal gradient method

In this section, we introduce an inexact proximal gradient method based on the previous inexact oracle definition for solving (non)convex composite minimization problems (1). We derive complexity estimates for this algorithm and study the dependence between the accuracy of the oracle and the desired accuracy of the gradient or of the objective function. Hence, we consider the following Inexact Proximal Gradient Method (I-PGM).

Algorithm 1
figure a

Inexact proximal gradient method (I-PGM)

Note that Algorithm 1 is an inexact proximal gradient method, where the inexactness comes from the approximate computation of the (sub)gradient of F, denoted \(g_{\delta _k,L_k,q}(x_k)\). In the next sections we analyze the convergence behavior of this algorithm when \(g_{\delta _k,L_k,q}(x_k)\) satisfies Definition 2.

4.1 Nonconvex convergence analysis

In this section we consider a nonconvex function F that admits an inexact first-order \((\delta ,L)\)-oracle of degree q as in Definition 2. Using this definition and inequality (4), for all \(\rho >0\) we get the following upper bound:

$$\begin{aligned} F(x) - \Big (F(y) + \langle g_{\delta ,L,q}(y),x-y \rangle \Big )\le \frac{L + q\rho }{2}\Vert x-y\Vert ^{2} + \frac{(2-q)\delta ^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$
(9)

This inequality will play a key role in our convergence analysis. We define the gradient mapping at iteration k as \(g_{\delta _k,L_k,q}(x_k) + p_{k+1}\), where \(p_{k+1}\in \partial h(x_{k+1})\) such that \( g_k + p_{k+1} = -\frac{1}{\alpha _k}(x_{k+1} - x_k)\) (i.e., \(p_{k+1}\) is the subgradient of h at \(x_{k+1}\) coming from the optimality condition of the prox at \(x_k\)). Next we analyze the global convergence of I-PGM in the norm of the gradient mapping. We have the following theorem:

Theorem 1

Let F be a nonconvex function admitting a \((\delta _k,L_k)\)-oracle of degree \(q\in [0,2)\) at each iteration k, with \(\delta _k \ge 0\) and \(L_k > 0\) for all \(k\ge 0\). Let \((x_k)_{k\ge 0}\) be generated by I-PGM and assume that \(\alpha _k \le \frac{1}{L_k + q\rho }\), for some arbitrary parameter \(\rho >0\). Then, there exists \(p_{k+1} \in \partial h(x_{k+1})\) such that:

$$\begin{aligned} \sum _{j=0}^{k}\alpha _j \Vert g_{\delta _{j},L_{j},q}(x_j) + p_{j+1} \Vert ^2\le f(x_0) - f_\infty + \frac{\sum _{j=0}^{k}(2-q)\delta _j^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$
(10)

Proof

Denote \(g_{\delta _k,L_k,q}(x_k)= g_k\). From the optimality conditions of the proximal operator defining \(x_{k+1}\), we have:

$$\begin{aligned} g_k + p_{k+1} = -\frac{1}{\alpha _k} (x_{k+1} - x_k) . \end{aligned}$$

Further, from inequality (9), we get:

$$\begin{aligned}&F(x_{k+1}) \le F(x_k) + \langle g_k,x_{k+1} - x_k \rangle + \frac{L_k + q\rho }{2}\Vert x_{k+1} - x_k\Vert ^2 + \frac{(2-q) \delta _k^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}\\&= F(x_k) \!+\! \langle g_k \!+\! p_{k+1},x_{k+1} \!-\! x_k \rangle \!-\! \langle p_{k+1},x_{k+1} \!-\! x_k \rangle \!+\! \frac{L_k + q\rho }{2}\Vert x_{k+1} \!-\! x_k\Vert ^2\\&\quad + \frac{(2-q)\delta _k^{\frac{2}{2\!-\!q}}}{2\rho ^{\frac{q}{2-q}}}\\&\le F(x_k) \!-\!\alpha _k \left( 1 - \frac{(L_k \!+\! q\rho ) \alpha _k }{2}\right) \Vert g_k \!+\! p_{k+1} \Vert ^2 \!+\! h(x_k) \!-\! h(x_{k+1}) \!+\! \frac{(2\!-\!q) \delta _k^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}\\&\le F(x_k) -\frac{\alpha _k}{2} \Vert g_k + p_{k+1}\Vert ^2 + h(x_k) - h(x_{k+1}) + \frac{(2\!-\!q)\delta _k^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}, \end{aligned}$$

where the second inequality follows from the convexity of h and \(p_{k+1}\in \partial h(x_{k+1})\), and the last inequality follows from the definition of \(\alpha _k\). Hence, we get that:

$$\begin{aligned} f(x_{k+1})\le f(x_k) - \frac{\alpha _k}{2} \Vert g_k \!+\! p_{k+1}\Vert ^2 + \frac{(2-q) \delta _k^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$

Summing up this inequality from \(j=0\) to \(j=k\) and using the fact that \(f(x_{k+1}) \ge f_\infty \), where recall that \(f_\infty \) denotes a finite lower bound for the objective function, we get:

$$\begin{aligned} \sum _{j=0}^{k} \frac{\alpha _j}{2} \Vert g_j + p_{j+1}\Vert ^2&\le f(x_0) - f(x_{k+1}) + \frac{\sum _{j=0}^{k}(2-q) \delta _j^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}\\&\le f(x_0) - f_\infty + \frac{\sum _{j=0}^{k}(2-q)\delta _j^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$

Hence, our statement follows. \(\square \)

For a particular choice of the algorithm parameters, we can get simpler convergence estimates.

Theorem 2

Let the assumptions of Theorem 1 hold and consider for all \(k \ge 0\):

$$\begin{aligned} L_k = L,\; \delta _k = \frac{\delta }{(k+1)^{\frac{\beta (2-q)}{2}}}, \;\alpha _k = \frac{1}{(L + q\rho )(k+1)^{\zeta }}, \; where \; \beta , \zeta \in [0,1). \end{aligned}$$

Then, we have:

$$\begin{aligned}&\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2 \le \frac{2(L \!+\! q\rho )(f(x_0) \!-\! f_\infty )}{(1 \!-\! \zeta )(k+1)^{1-\zeta }} \!+\! \frac{(2\!-\!q)(L \!+\! q\rho )\delta ^{\frac{2}{2-q}} }{(1-\zeta )(1-\beta )\rho ^{\frac{q}{2-q}} (k+1)^{\beta -\zeta }}. \end{aligned}$$
(11)

Proof

Taking the minimum in the inequality (10), we get:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(f(x_0) - f_\infty )}{\sum _{j=0}^{k} \alpha _j} + \frac{\sum _{j=0}^{k} (2-q)\delta _j^{\frac{2}{2-q}}}{\rho ^{\frac{q}{2-q}} \sum _{j=0}^{k-1}\alpha _j}. \end{aligned}$$

Further, since we have:

$$\begin{aligned} \sum _{j=0}^{k} \frac{1}{(L+q\rho )(j+1)^\zeta } = \sum _{j=1}^{k+1} \frac{1}{(L+q\rho )j^\zeta }, \end{aligned}$$

and similarly for \(\delta _j\), we get:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(L + q\rho ) (f(x_0) - f_\infty )}{\sum _{j=1}^{k+1} \frac{1}{j^\zeta }} + \frac{(2 - q)(L + q\rho ) \delta ^{\frac{2}{2-q}} \sum _{j=1}^{k+1} \frac{1}{j^\beta }}{\rho ^{\frac{q}{2-q}} \sum _{j=1}^{k+1} \frac{1}{j^\zeta }}. \end{aligned}$$

Since \(0 \le \zeta <1\), then we have for all \(k\ge 0\):

$$\begin{aligned} (1-\zeta )(k+1)^{1-\zeta }\le \frac{(k+2)^{1-\zeta } - 1}{1-\zeta }&= \int _{1}^{k+2}\frac{1}{u^{\zeta }} du\\&\le \sum _{j=1}^{k+1} \frac{1}{j^\zeta } \le \int _{1}^{k+1} \left( \frac{1}{u^\zeta } \right) du + 1 \le \frac{(k+1)^{1-\zeta }}{1-\zeta }. \end{aligned}$$

It follows that for all \(k\ge 0\):

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(L \!+\! q\rho )(f(x_0) \!-\! f_\infty )}{(1 \!-\! \zeta ) (k+1)^{1-\zeta }} +\frac{(2\!-\!q)(L \!+\! q\rho ) \delta ^{\frac{2}{2-q}} }{(1-\zeta )(1-\beta )\rho ^{\frac{q}{2-q}} (k+1)^{\beta -\zeta }}. \end{aligned}$$

Hence, our statement follows. \(\square \)

Let us analyze in more details the bound from Theorem 2. For simplicity, consider the case \(q = 1\) (see Example 1). Then, we have:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(L + \rho )(f(x_0) - f_\infty )}{(1-\zeta )(k+1)^{1 - \zeta }} + \frac{(L + \rho ) \delta ^2}{\rho (1-\zeta )(1-\beta ) (k+1)^{\beta -\zeta }}\\&= \frac{2L(f(x_0) - f_\infty )}{(1-\zeta )(k+1)^{1 - \zeta }} + \frac{2\rho (f(x_0) - f_\infty )}{(1-\zeta )(k+1)^{1 - \zeta }} \\&\quad + \frac{L \delta ^2}{\rho (1-\zeta )(1-\beta )(k+1)^{\beta -\zeta }} + \frac{\delta ^2}{ (1-\zeta )(1-\beta )(k+1)^{\beta -\zeta }}. \end{aligned}$$

Denote \(\Delta _0:= f(x_0) - f_\infty \). Since parameter \(\rho > 0\) is a degree of freedom, minimizing the right hand side of the previous relation w.r.t. \(\rho \) we get an optimal choice \(\rho = \frac{\delta \sqrt{L}}{\sqrt{2\Delta _0(1-\beta )}}(k+1)^{\frac{1 - \beta }{2}}\). Hence, replacing this expression for \(\rho \) in the last inequality, we get:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2L\Delta _0}{(1-\zeta ) (k+1)^{1-\zeta }} + \frac{2\delta \sqrt{2L\Delta _0}}{((1-\zeta ) \sqrt{1-\beta })(k+1)^{\frac{1+\beta }{2} - \zeta }}\\&+ \frac{\delta ^2}{(1-\zeta )(1-\beta )(k+1)^{\beta -\zeta }}. \end{aligned}$$

This bound is of order \(\mathcal {O}\left( \frac{1}{k^{1-\zeta }} + \frac{\delta }{k^{\frac{1+\beta }{2} - \zeta }} + \frac{\delta ^2}{k^{\beta -\zeta }}\right) \). Note that, if \(\beta > \zeta \), the gradient mapping \(\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2\) converges regardless of the accuracy of the oracle \(\delta \) and the convergence rate is of order \(\mathcal {O}(k^{-\text {min}(1-\zeta ,\beta - \zeta )})\) (since we always have \(\frac{1+\beta }{2} - \zeta \ge \beta - \zeta \)). Note that this is not the case for \(q=0\), where the convergence rate is of order \(\mathcal {O}\left( \frac{1}{k} + \delta \right) \), see also [9]. The following corollary provides a convergence rate for general q, but for a particular choice of the parameters \(\zeta \) and \(\beta \).

Corollary 1

Let the assumptions of Theorem 2 hold and let assume that \(\zeta = \beta = 0\). Then, we have the following convergence rates:

  1. 1.

    If \(0\le q < 2\) and \(\rho = L\), then \(\delta _k = \delta \), \(\alpha _k = \frac{1}{L+qL}\) and

    $$\begin{aligned}&\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2 \le \frac{2(q+1)L \Delta _0}{k+1} +(q+1)(2-q)L^{\frac{2-2q}{2-q}} \delta ^{\frac{2}{2-q}} \quad \forall k \ge 0. \end{aligned}$$
  2. 2.

    If \(1\le q < 2\), fixing the number of iterations k and taking \(\rho = \frac{L^{\frac{2-q}{2}}\delta }{(2\Delta _0)^{\frac{2-q}{2}}}(k+1)^{\frac{2-q}{2}}\), then \(\delta _j = \delta \), \(\alpha _j =\frac{1}{L + q\rho }\) for all \(j=0:k\) and

    $$\begin{aligned}&\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2\\&\le \frac{2L\Delta _0}{k+1}\! +\! \frac{L^{\frac{2-q}{2}} (2\Delta _0)^{\frac{q}{2}}\delta \! +\! (2-q)\delta L^{1-\frac{q}{2}} (2\Delta _0)^{\frac{q}{2}}}{(k+1)^{\frac{q}{2}}}\!+\! \frac{q(2-q) \delta ^2 L^{1-q}(2\Delta _0)^{q-1}}{(k+1)^{q-1}}. \end{aligned}$$

Proof

Replacing \(\zeta = \beta = 0\) in inequality (11), we get:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(L + q\rho ) \Delta _0}{k+1} + \frac{(2-q)(L + q\rho )\delta ^{\frac{2}{2-q}}}{\rho ^{\frac{q}{2-q}}}\\&= \frac{2L\Delta _0}{k+1} + \frac{2q\rho \Delta _0}{k+1} + \frac{(2-q)L\delta ^{\frac{2}{2-q}}}{\rho ^{\frac{q}{2-q}}} + \frac{q(2-q) \delta ^{\frac{2}{2-q}}}{\rho ^{\frac{2q-2}{2-q}}}. \end{aligned}$$

If \(0\le q <2\), then taking \(\rho = L \) in the last inequality we get the first statement. Further, if \(1\le q <2\), minimizing over \(\rho \) the second and the third terms of the right side of the last inequality yields the optimal choice \(\rho = \frac{L^{\frac{2-q}{2}}\delta }{(2\Delta _0)^{\frac{2-q}{2}}}(k+1)^{\frac{2-q}{2}}\). Replacing this expression for \(\rho \) in the last inequality, we get:

$$\begin{aligned}&\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2\\&\le \frac{2L\Delta _0}{k+1}\! +\! \frac{L^{\frac{2-q}{2}} (2\Delta _0)^{\frac{q}{2}}\delta \! +\! (2-q)\delta L^{1-\frac{q}{2}} (2\Delta _0)^{\frac{q}{2}}}{(k+1)^{\frac{q}{2}}}\!+\! \frac{q(2-q) \delta ^2 L^{1-q}(2\Delta _0)^{q-1}}{(k+1)^{q-1}}, \end{aligned}$$

and this is the second statement. \(\square \)

Remark 4

Let us analyse in more details this convergence rate for Example 1. For \(q = 0\), we have that \(\delta = 2D\Delta \) and \(L = L_F\), where D is the diameter of \(\text {dom}f\). Hence, the convergence rate in this case becomes:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2 \le \frac{4L_F \Delta _0}{k+1} + 4DL_F\Delta . \end{aligned}$$

On the other hand, for \(q=1\), we have \(\delta = \Delta \) and \(L = L_F\). Thus, we get the following convergence rate:

$$\begin{aligned}&\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2 \le \frac{4 L_F\Delta _0}{k+1} +2\Delta ^2. \end{aligned}$$

Hence, if we want to achieve \(\min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2 \le \epsilon \), for \(q=0\) we impose \( 4DL_F\Delta \le \epsilon /2\), which implies that one needs to compute an approximate gradient with accuracy \(\Delta = \mathcal {O}(\epsilon )\), while for \(q=1\) we impose \( 2\Delta ^2 \le \epsilon /2\), meaning that one only needs to compute an approximate gradient with accuracy \( \Delta = \mathcal {O}(\epsilon ^{1/2})\). Hence, for this example, it is more natural to use our inexact first-order oracle definition for \(q=1\) than for \(q=0\), since it requires less accuracy for approximating the true gradient.

Note that in the second result of Corollary 1, the parameter \(\rho \) depends on the difference \(\Delta _0 = f(x_0) - f_\infty \), and, usually, \(f_\infty \) is unknown. In practice, we can approximate \(\Delta _0\) by using an estimate for \(f_\infty \) in place of its exact value. For example, one can consider \(\Delta _0^k = f(x_0) - f_\text {best}^k\), where \(f_\text {best}^k = \min _{j=0:k} f(x_j) - \varepsilon _k\) for some \(\varepsilon _k \ge 0\), see [10]. Under this setting, the sequence \(\varepsilon _k\) and the iterates of I-PGM corresponding to the case of the second result of Corollary 1 are updated as follows:

Algorithm 2
figure b

Adaptive I-PGM algorithm when f is unknown

This process is well defined, i.e., the “while” step finishes in a finite number of iterations. Indeed, one can observe that if \(\varepsilon _k \ge \min _{j=0:k} f(x_j) - f_\infty \) then \(\varepsilon _k \ge \min _{j=0:k} f(x_j) - f(x_{k+1})\), which implies that \(f(x_{k+1}) \ge f_\text {best}^k\). Additionally, we have \(\varepsilon _{k} \le 2(\min _{j=0:k} f(x_j) - f_\infty )\) for all \(k\ge 0\). Hence, we can still derive a convergence rate for the second result of Corollary 1 using this adaptive process since one can observe that:

$$\begin{aligned} f(x_0) - f(x_{k+1})&\le f(x_0) - f_\text {best}^k = \Delta _0^k. \end{aligned}$$

Additionally, we have the following bound on \(\Delta _0^k\):

$$\begin{aligned} \Delta _0^k&\le f(x_0) - \min \limits _{j=0:k}f(x_j) + 2(\min _{j=0:k} f(x_j) - f_\infty )\\&= (f(x_0) - f_\infty ) + (\min \limits _{j=0:k}f(x_j) - f_\infty ). \end{aligned}$$

Hence, we can replace in (10) the difference \(\Delta _0 = f(x_0) - f_\infty \) with \(\Delta _0^k\) and then the second statement of Corollary 1 remains valid with \(\Delta _0^k\) instead of \(\Delta _0\).

Remark 5

We observe that for \(q=0\) we recover the same convergence rate as in [9]. However, our result does not require the boundedness of the domain of f, while in [9] the rate depends explicitly of the diameter of the domain of f. Moreover, for \(q >0\) our convergence bounds are better than in [9], i.e., the coefficients of the terms in \(\delta \) are either smaller or even tend to zero, while in [9] they are always constant.

Further, let us consider the case of Example 4, where F satisfies the Hölder condition with constant \(\nu \in (0,1]\) and \(\beta =\zeta =0\). We have shown that for any \(\delta >0 \) this class of functions can be equipped with a \((\delta ,L)\)-oracle of degree \(q< 1+\nu \) with \(L= C(H_{\nu },q)\left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu - q}}\) (see Example 4 for the expression of the constant \(C(H_\nu ,q)\)). In view of the first result of Corollary 1, after k iterations, we have:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le \frac{2(q+1)\Delta _0 L}{k+1} +(q+1)(2-q)L^{\frac{2-2q}{2-q}}\delta ^{\frac{2}{2-q}}\\&= \frac{C_1}{k+1} \left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu -q}} + C_2\left( \frac{1}{\delta }\right) ^{\frac{(1-\nu )(2-2q)}{(1+\nu -q)(2-q)}}\delta ^{\frac{2}{2-q}}\\&= \frac{C_1}{k+1}\delta ^{-\frac{1-\nu }{1+\nu -q}} + C_2 \delta ^{-\frac{(1-\nu )(2-2q)}{(1+\nu -q)(2-q)} + \frac{2}{2-q}}\\&= \frac{C_1}{k+1}\delta ^{-\frac{1-\nu }{1+\nu -q}} + C_2 \delta ^{\frac{2\nu }{1+\nu -q}}, \end{aligned}$$

where \(C_1: = 2(q+1)\Delta _0 C(H_\nu ,q)\) and \(C_2 = (q+1)(2-q)C(H_\nu ,q)^{\frac{2-2q}{2-q}}\). Since in this example we can choose \(\delta \), its optimal value can be computed from the following equation:

$$\begin{aligned} -\frac{C_1(1 - \nu )}{(1 + \nu - q)}\frac{1}{(k+1)}\delta ^{\frac{q-2}{1 + \nu - q}} + \frac{2\nu C_2}{1+\nu -q}\delta ^{\frac{-1 + \nu + q}{1 + \nu - q}} = 0. \end{aligned}$$

Hence, we get:

$$\begin{aligned} \delta = C_3 (k+1)^{-\frac{1+\nu -q}{1+ \nu }}, \end{aligned}$$

where \(C_3 = \left( \frac{2\nu C_2}{(1-\nu )C_1}\right) ^{-\frac{1+\nu -q}{1+ \nu }}\). Thus, replacing this optimal choice of \(\delta \) in the last inequality, we get:

$$\begin{aligned} \min _{j=0:k} \Vert g_j + p_{j+1}\Vert ^2&\le C_1C_3\left( (k+1)^{-\left( 1 - \frac{1-\nu }{1 +\nu }\right) }\right) + C_2C_3\left( (k+1)^{-\frac{2\nu }{ 1 + \nu } }\right) \\&= \frac{C_3(C_1 + C_2)}{(k+1)^{\frac{2\nu }{1 + \nu }}}. \end{aligned}$$

Remark 6

Note that our convergence rate of order \(\mathcal {O}(k^{-\frac{2\nu }{1+\nu }})\) for Algorithm 1 (I-PGM) for nonconvex problems having the first term F with a Hölder continuous gradient (Example 4) recovers the rate obtained in [9] under the same settings.

Finally, let us now show that when the gradient mapping is small enough, i.e., \(\Vert g_k + p_{k+1} \Vert \) is small, \(x_{k+1}\) is a good approximation for a stationary point of problem (1). Note that any choice \(\alpha _k \le \frac{1}{L + q\rho _k} \) yields:

$$\begin{aligned} \Vert x_{k+1} - x_{k} \Vert \le \frac{1}{L} \left\| \frac{1}{\alpha _k}(x_{k+1} - x_k) \right\| = \frac{1}{L}\Vert g_k + p_{k+1}\Vert . \end{aligned}$$

Hence, if the gradient mapping is small, then the norm of the difference \(\Vert x_{k+1} - x_k \Vert \) is also small.

Theorem 3

Let \((x_{k})_{k\ge 0}\) be generated by I-PGM and let \(p_{k+1} \in \partial h(x_{k+1})\). Assume that we are in the case of Example 1. Then, we have:

$$\begin{aligned} dist (0,\partial f(x_{k+1}))\le \Vert g_{\Delta ,L_F,q}(x_k) + p_{k+1}\Vert + L_F \Vert x_{k+1} - x_k\Vert + \Delta . \end{aligned}$$

Further, if we are in the case of Example 4, then we have:

$$\begin{aligned} dist (0,\partial f(x_{k+1}))\le \Vert g(x_k) + p_{k+1} \Vert + H_\nu \Vert x_{k+1} - x_k \Vert ^{\nu }, \;\; g(x_k)\in \partial F(x_k). \end{aligned}$$

Proof

Let us consider Example 1, where F is \(L_F\) smooth and h is convex. Since \(\nabla F(x_{k+1}) + p_{k+1} \in \partial f(x_{k+1})\), then we have:

$$\begin{aligned}&\Vert \nabla F(x_{k+1}) + p_{k+1} \Vert \\&\le \Vert g_{\Delta ,L_F,q}(x_k) + p_{k+1} \Vert + \Vert \nabla F(x_k) - g_{\Delta ,L_F,q}(x_k) \Vert + \Vert \nabla F(x_{k+1}) - \nabla F(x_k) \Vert \\&\le \Vert g_{\Delta ,L_F,q}(x_k) + p_{k+1} \Vert + \Delta + L_F \Vert x_{k+1} - x_k \Vert . \end{aligned}$$

Further, let us assume that we are in the case of Example 4. Then, we have \(g(x_k) \in \partial F(x_k)\). Further, let \( g(x_{k+1}) \in \partial F(x_{k+1})\), then we get:

$$\begin{aligned} \Vert g(x_{k+1}) + p_{k+1} \Vert&\le \Vert g(x_k) + p_{k+1} \Vert + \Vert g(x_{k+1}) - g(x_k) \Vert \\&\le \Vert g(x_k) + p_{k+1} \Vert + H_\nu \Vert x_{k+1} - x_k \Vert ^{\nu }. \end{aligned}$$

This proves our statements. \(\square \)

Thus, for \(\Vert \frac{1}{\alpha _k}(x_{k+1} - x_k) \Vert = \Vert g_k + p_{k+1}\Vert \) small, \(x_{k+1}\) is an approximate stationary point of problem (1). Note that our convergence rates from this section are better as q increases, i.e., the terms depending on \(\delta \) are smaller for \(q>0\) than for \(q=0\). In particular, the power of \(\delta \) in the convergence estimate is higher for \(q \in (0,1)\) than for \(q=0\), while for \(q \ge 1\) the coefficients of \(\delta \) even diminish with k. Hence, it is beneficial to have an inexact first-order oracle of degree \(q > 0\), as this allows us to work with less accurate approximation of the (sub)gradient of the nonconvex function F than for \(q = 0\).

4.2 Convex convergence analysis

In this section, we analyze the convergence rate of I-PGM for problem (1), where F is now assumed to be a convex function. By adding extra information to the oracle (5), we consider the following modification of Definition 2:

Definition 3

Given a convex function F, then it is equipped with an inexact first-order \((\delta , L)\)-oracle of degree \(0\le q < 2\) if for any \(y \in \text {dom} f\) we can compute a vector \( g_{\delta ,L,q}(y)\) such that:

$$\begin{aligned} 0\!\le \! F(x) \!-\! \left( F(y) \!+\! \langle g_{\delta ,L,q} (y),x\!-\!y \rangle \right) \!\le \! \frac{L}{2}\Vert x \!-\! y \Vert ^{2} \!\!+\! \delta \Vert y\!-\!x \Vert ^q \;\;\; \forall x \!\in \! \text {dom} f. \end{aligned}$$
(12)

Note that Example 4 satisfies this definition. In (12), the zero-order information is considered to be exact. This is not the case in [7], which considers the particular choice \(q=0\). Further, the first-order information \(g_{\delta ,L,q}\) is a subgradient of f at y in (12), while in [7] it is a \(\delta \)-subgradient. However, using this inexact first-order oracle of degree q, I-PGM provides better rates compared to [7]. From (12) and (4), we get:

$$\begin{aligned} 0\le F(x) \!-\! \left( F(y) \!+\! \langle g_{\delta ,L,q}(y),x \!-\! y \rangle \right) \le \frac{L \!+\! q\rho }{2}\Vert x\!-\!y\Vert ^{2} + \frac{(2-q)\delta _{q}^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}, \end{aligned}$$
(13)

for all \(\rho >0\). Next, we analyze the convergence rate of I-PGM in the convex setting. We have the following convergence rate:

Corollary 2

Let F be a convex function admitting a \((\delta ,L)\)-oracle of degree \(q\in [0,2)\) (see Definition 3). Let \((x_k)_{k\ge 0}\) be generated by I-PGM and assume that \(\alpha _k = \frac{1}{L+q\rho }\), with \(\rho >0\). Define \(\hat{x}_k = \frac{\sum _{i=0}^{k}x_{i+1}}{k+1} \) and \(R = \Vert x_0 - x^* \Vert \). Then, we have:

$$\begin{aligned} f(\hat{x}_k) - f^* \le \frac{(L + q\rho )R^2}{2k} + \frac{(2-q)\delta ^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$
(14)

Proof

Follows from (13) and Theorem 2 in [7]. \(\square \)

Since we have the freedom of choosing \(\rho \), let us minimize the right hand side of (14) over \(\rho \). Then, \(\rho \) must satisfy \(\frac{qR^2}{2k} - \frac{q \delta ^{\frac{2}{2-q}}}{2}\rho ^{\frac{-2}{2-q}} = 0.\) Thus, the optimal choice is \(\rho = \frac{\delta }{R^{2-q}}k^\frac{2-q}{2}\). Finally, fixing the number of iterations k and replacing this expression in equation (14), we get:

$$\begin{aligned} f(\hat{x}_k) - f^* \le \frac{L R^2}{2k} + \delta \frac{(2+q) R^{q}}{2k^{\frac{q}{2}}}. \end{aligned}$$

One can notice that our rate in function values is of order \(\mathcal {O}(k^{-1} + \delta k^{-\frac{q}{2}})\), while in [7] the rate is of order \(\mathcal {O}(k^{-1} + \delta )\). Hence, when \(q>0\), regardless of the accuracy of the oracle, our second term diminishes, while in [7] it remains constant. Hence, our new definition of inexact oracle of degree q, Definition 3, is also beneficial in the convex case when analysing proximal gradient type methods, i.e., large q yields better rates.

We also consider an extension of the fast inexact projected gradient method from [7], where the projection is replaced by a proximal step with respect to the function h (see [15]), called FI-PGM. Note that the inexactness in FI-PGM comes from the approximate computation of the (sub)gradient of F, denoted \(g_{\delta _k,L_k,q}(x_k)\), as given in Definition 3. Let \((\theta _k)_{k\ge 0}\) be a sequence such that:

$$\begin{aligned} \theta _0 \in (0,1],\quad \frac{\theta _{k+1}^2}{L_{k+1}}\le A_{k+1}:=\sum _{i=0}^{k+1}\frac{\theta _i}{L_i} \;\;\; \forall k\ge 0. \end{aligned}$$
(15)

Then, the fast inexact proximal gradient method (FI-PGM) is as follows:

Algorithm 3
figure c

Fast inexact proximal gradient method (FI-PGM)

Using a similar proof as in [7], we get the following convergence rate for FI-PGM algorithm:

Corollary 3

Let F satisfy the assumptions of Lemma 2 and \((y_k)_{k\ge 0}\) be generated by FI-PGM. Then, for all \(\rho >0\), we have the following rate:

$$\begin{aligned} f(y_{k}) - f^* \le \frac{4(L+q\rho )R^2}{(k+1)(k+2)} + \frac{(k+3)(2-q)\delta ^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}}. \end{aligned}$$
(16)

Proof

The proof follows from (13) and Theorem 4 in [7]. \(\square \)

The optimal \(\rho \) in the right hand side of inequality (16) is

$$\begin{aligned} \rho ^* = \frac{\big ((k+1)(k+2)(k+3)\big )^{\frac{2-q}{2}}}{(8R^2)^{\frac{2-q}{2}}} \delta . \end{aligned}$$

Further, replacing \(\rho \) with its optimal value in the inequality (16), we get

$$\begin{aligned} f(y_k) - f^*\le & \frac{4L R^2}{(k+1)(k+2)} + \frac{ q8^{\frac{q}{2}} R^q (k+3)}{2((k+1)(k+2) (k+3))^{\frac{q}{2}}} \delta \\&+ \frac{(2-q)8^{\frac{q}{2}} R^q(k+3)}{2((k+1)(k+2) (k+3))^{\frac{q}{2}}}\delta \\ =& \frac{4L R^2}{(k+1)(k+2)} + \frac{8^{\frac{q}{2}} R^q(k+3)}{((k+1) (k+2)(k+3))^{\frac{q}{2}}} \delta .\\=& \mathcal {O}\left( \frac{LR^2}{k^2}\right) + \mathcal {O}\left( \frac{R^q}{k^{\frac{3q}{2} - 1}}\delta \right) . \end{aligned}$$

Hence, if \(q > \frac{2}{3}\), then FI-PGM doesn’t have error accumulation under our inexact oracle as the rate is of order \(\mathcal {O}\left( k^{-2} + \delta k^{1-\frac{3q}{2}}\right) \), while in [7] the FI-PGM scheme always displays error accumulation, as the convergence rate is of order \(\mathcal {O}(k^{-2} + \delta k)\). Therefore, the same conclusion holds as for I-PGM, i.e., for the FI-PGM scheme in the convex setting it is beneficial to have an inexact first-order oracle with large degree q.

Remark 7

In our Definition 2 we have considered exact zero-order information. However, it is possible to change this definition considering also inexact zero-order information for the nonconvex case. More precisely, we can change Definition 2 as follows

$$\begin{aligned} \left\{ \begin{aligned}&F_{\delta _0}(x) - F(x) \le \delta _0, \\&F(x) \!-\! \left( F_{\delta _0}(y) + \langle g_{\delta ,L,q}(y),x-y \rangle \right) \!\le \! \frac{L}{2}\Vert x-y \Vert ^{2} + \delta \Vert x - y \Vert ^q. \end{aligned}\right. \end{aligned}$$

With this new definition, the convergence result in Theorem 1 becomes:

$$\begin{aligned} \sum _{j=0}^{k}\alpha _j \Vert g_{\delta _{j},L_{j},q}(x_j) + p_{j+1} \Vert ^2\le f(x_0) - f_\infty + \frac{\sum _{j=0}^{k}(2-q)\delta _j^{\frac{2}{2-q}}}{2\rho ^{\frac{q}{2-q}}} + \sum _{j=0}^{k}\delta _0. \end{aligned}$$

Hence the rate in this case is also influenced by the inexactness of the zero-order information (i.e., \(\delta _0\)). Note that for the convex case, the previous extension is not possible in Definition 3 when \(q>0\), since we must have:

$$\begin{aligned} 0\le F(x) \!-\! \left( F_{\delta _0}(y) + \langle g_{\delta ,L,q}(y),x-y \rangle \right) \!\le \! \frac{L}{2}\Vert x-y \Vert ^{2} + \delta \Vert x - y \Vert ^q, \end{aligned}$$

which implies for \(x=y\) that \(F(x) = F_{\delta _0}(x)\). Since we want to have consistency between Definitions 2 and 3, we have chosen to work with the exact zero-order information in our previous nonconvex convergence analysis.

5 Numerical simulations

In this section, we evaluate the performance of I-PGM for a composite problem arising in image restoration. Namely, we consider the following nonconvex optimization problem [12]:

$$\begin{aligned}&\min _{x\in \mathbb {R}^n} \sum _{i=1}^{N} \text {log} \left( \left( a_i^{T}x - b_i\right) ^2 + 1\right) ,\\&\text {s.t.}\;\; \Vert x\Vert _1\le R,\nonumber \end{aligned}$$
(17)

where \(R>0\), \(b\in \mathbb {R}^N\) and \(a_i\in \mathbb {R}^n\), for \(i=1:N\). In image restoration, b represents the noisy blurred image and \(A = (a_1,\cdots ,a_N) \in {\mathbb {R}^{n\times N}}\) is a blur operator [12]. This problem fits into our general problem (1), with \(F(x) = \sum _{i=1}^{N} \text {log}\left( \left( a_i^{T}x - b_i \right) ^2 + 1\right) \), which is a nonconvex function with Lipschitz continuous gradient of constant \(L_F:= \sum _{i=1}^{N}\Vert a_i\Vert ^2\), and h(x) is the indicator function of the bounded convex set \(\lbrace x:\Vert x\Vert _1\le R\rbrace \). We generate the inexact oracle by adding normally distributed random noise \(\delta \) to the true gradient, i.e., \(g_{\delta ,L,q}(x):= \nabla F(x) + \delta \). This is a particular case of Example 1. However, for all x and y satisfying \(\Vert x\Vert \le R\), \(\Vert y\Vert \le R \), we have the following:

$$\begin{aligned} \delta \Vert x - y\Vert&= \delta \Vert x - y\Vert ^{1-q}\Vert x - y\Vert ^q \le \delta (2R)^{1-q}\Vert x - y\Vert ^q. \end{aligned}$$

Thus, this example satisfies Definition 2 for all \(q\in [0, 1]\). We apply I-PGM for this particular example where we consider three choices for the degree q: 0, 1/2 and 1. Recall that the convergence rate of I-PGM with constant step size is (see Corollary 1, first statement):

$$\begin{aligned}&\min _{j=0:k} \Vert g_j \!+\! p_{j+1}\Vert ^2 \le \frac{2(q\!+\!1)L(f(x_0) \!-\! f^*)}{k+1} +(q\!+\!1)(2\!-\!q)L^{\frac{2-2q}{2-q}}\delta ^{\frac{2}{2-q}}. \end{aligned}$$
(18)

At each iteration of I-PGM we need to solve the following convex subproblem:

$$\begin{aligned}&\min _{x\in \mathbb {R}^n} F(x_k) + \langle g_{\delta ,q}(x_k),x-x_k\rangle + \frac{L + q\rho }{2}\Vert x - x_k\Vert ^2,\;\; \text {s.t.} \;\; \Vert x\Vert _1 \le R. \end{aligned}$$

This subproblem has a closed form solution (see e.g., [19]). We compare I-PGM with constants step size \(\alpha _k = \frac{1}{2(L_F+q\rho )}\) and \(\rho = L_F\) for three choices of \(q=0, 1/2, 1\) and three choices of noise norm \(\Vert \delta \Vert \le 0.1, 1, 3\), respectively. The results are given in Fig. 1 (dotted lines), where we plot the evolution of the error \(\min _{j=0:k}\Vert \frac{1}{\alpha _k}(x_{j+1} - x_j)\Vert ^2\), which corresponds to the gradient mapping. In the same figure we also plot the theoretical bounds (18) for \(q=0, 1/2, 1\) (full lines). Our main figures are Fig. 1a, c, d, while Fig. 1b is a subfigure (zoom) of Fig. 1a, displaying only the first 300 iterations. Moreover, one can see in these main figures (i.e., Fig. 1a, c, d) that the behaviour of our algorithm for \(q=1\) is better than for \(q=1/2\). Similarly, the behaviour of our algorithm for \(q=1/2\) is better than for \(q=0\). One can observe these better behaviours after 300 iterations when the error \(\delta \) is small (see Fig. 1c, d). However, when the error \(\delta \) is large, we need to perform a larger number of iterations before we can observe these behaviours, (see Fig. 1a, b). This is natural, since large errors on the gradient approximation must have impact on the convergence speed. Hence, as the degree q increases or the norm of the noise decreases, better accuracies for the norm of the gradient mapping can be achieved, which supports our theoretical findings.

Moreover, from the numerical simulations, one can observe that the gap between the theoretical and the practical bounds is large in Fig. 1c, d. We believe that this happens because, in the convergence analysis, the theoretical bounds are derived under worst-case scenarios (i.e., the convergence analysis must account for the worst case direction generated by the inexact first-order oracle, while in practical implementations, which often involve randomness, one usually doesn’t encounter these worst-case directions). However, the simulations in Fig. 1a show that the gap between the theoretical bounds and the practical behavior is not too large. More precisely, we have generated at each iteration 100 random directions and, in order to update the new point, we have chosen the worst direction with respect to the gradient mapping (i.e., the largest) \(\Vert x_{k+1} - x_k\Vert \)). The results are given in Fig. 1a, where one can see that the theoretical and practical bounds are getting closer for sufficiently large number of iterations.

Fig. 1
figure 1

Practical (dotted lines) and theoretical (full lines) performances of the I-PGM algorithm for different choices of q and \(\delta \), and with \(R = 4\). Figure (b) represents a zoom of the left corner from Figure (a)