1 Introduction

We consider the gradient method for the unconstrained optimization problem

$$\begin{aligned} f^\star :=\inf _{x\in {\mathbb {R}}^n} f(x), \end{aligned}$$
(1)

where \(f: {\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is differentiable, and \(f^\star \) is finite. The gradient method with fixed step lengths may be described as follows.

figure a

In addition, we assume that f has a maximum curvature \(L\in (0, \infty )\) and a minimum curvature \(\mu \in (-\infty , L)\). Recall that f has a maximum curvature L if \(\tfrac{L}{2}\Vert .\Vert ^2-f\) is convex. Similarly, f has a minimum curvature \(\mu \) if \(f-\tfrac{\mu }{2}\Vert .\Vert ^2\) is convex. We denote smooth functions with curvature belonging to the interval \([\mu , L]\) by \({\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\). The class \({\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) includes all smooth functions with Lipschitz gradient (note that \(\mu \ge 0\) corresponds to convexity). Indeed, f is L-smooth on \({\mathbb {R}}^n\) if and only if f has a maximum and minimum curvature \(\bar{L}>0\) and \(\bar{\mu }\), respectively, with \(\max (\bar{L}, |\bar{\mu }|)\le L\). This class of functions is broad and appears naturally in many models in machine learning, see [8] and the references therein.

For \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\), we have the following inequalities for \(x, y\in {\mathbb {R}}^n\)

$$\begin{aligned} f(y)&\le f(x)+\langle \nabla f(x), y-x\rangle +\tfrac{L}{2}\Vert y-x\Vert ^2, \end{aligned}$$
(2)
$$\begin{aligned} f(y)&\ge f(x)+\langle \nabla f(x), y-x\rangle +\tfrac{\mu }{2}\Vert y-x\Vert ^2; \end{aligned}$$
(3)

see Lemma 2.5 in [21].

It is known that the lower bound of first order methods for obtaining an \(\epsilon \)-stationary point is of the order \(\Omega \left( \epsilon ^{-2}\right) \) for L-smooth functions [6]. Hence, it is of interest to investigate the classes of functions for which the gradient method enjoys linear convergence rate. This subject has been investigated by some scholars and some classes of functions have been introduced where linear convergence is possible; see [7, 14,15,16] and the references therein. This includes the class of functions satisfying the Polyak-Łojasiewicz (PŁ) inequality [16, 20].

Definition 1

A function f is said to satisfy the PŁ inequality on \(X\subseteq {\mathbb {R}}^n\) if there exists \(\mu _p>0\) such that

$$\begin{aligned} f(x)-f^\star \le \tfrac{1}{2\mu _p} \Vert \nabla f(x)\Vert ^2, \ \ \ \forall x\in X. \end{aligned}$$
(4)

Note that the PŁ inequality is also known as gradient dominated; see [19, Definition 4.1.3]. Strongly convex functions satisfy the PŁ inequality, but some classes of non-convex functions also fulfill this inequality. For instance, consider a differentiable function \(G:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) with \(m\le n\). Suppose that the non-linear system \(G(x)=0\) has some solution. If

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n}\lambda _{\min }\left( J_G(x) J_G(x)^T\right) =\sigma >0, \end{aligned}$$

where \(J_G(x)\) is the Jacobian matrix of G at x, then the function \(f(x)=\Vert G(x)\Vert ^2\) fulfils the PŁ inequality; see [19, Example 4.1.3]. Here, \(\lambda _{\min }(A)\) denotes the smallest eigenvalue of symmetric matrix A. In other words, nonlinear least squares problems often correspond to instances of (1) where the objective satisfies the PŁ inequality.

The following classical theorem provides a linear convergence rate for Algorithm 1 under the PŁ inequality.

Theorem 1

[20, Theorem 4] Let f be L-smooth and let f satisfy PŁ inequality on \(X=\{x: f(x)\le f(x^1)\}\). If \(t_1\in (0, \tfrac{2}{L})\) and \(x^2\) is generated by Algorithm 1, then

$$\begin{aligned} f(x^2)-f^\star \le \left( 1-t_1\mu _p(2-t_1L)\right) \left( f(x^1)-f^\star \right) . \end{aligned}$$
(5)

In particular, if \(t_1=\tfrac{1}{L}\), we have

$$\begin{aligned} f(x^2)-f^\star \le \left( 1-\tfrac{\mu _p}{L}\right) \left( f(x^1)-f^\star \right) . \end{aligned}$$
(6)

In this paper we will sharpen this bound; see Theorem 3. Under the assumptions of Theorem 1, Karimi et al. [16] established linear convergence rates for some other methods including the randomized coordinate descent. We refer the interested reader to the recent survey [7] for more details on the convergence of non-convex algorithms under the PŁ inequality.

In this paper, we analyse the gradient method from black-box perspective, which means that we have access to the gradient and the function value at the given point. Furthermore, we study the convergence rate of Algorithm 1 by using performance estimation.

In recent years, performance estimation has been used to find worst-case convergence rates of first order methods [1, 2, 9, 10, 13, 23], to name but a few. This strong tool first has been introduced by Drori and Teboulle in their seminal paper [12]. The idea of performance estimation is that the infinite dimensional optimization problem concerning the computation of convergence rate may be formulated as a finite dimensional optimization problem (often semidefinite programs) by using interpolation theorems. The interested reader may consult the PhD theses of Drori [11] and Taylor [22] for an introduction to, and review of the topic.

The rest of the paper is organized as follows. In Sect. 2, we consider problem (1) when f satisfies the PŁ inequality. We derive a new linear convergence rate for Algorithm 1 by using performance estimation. Furthermore, we provide an optimal step length with respect to the given bound. We also show that the PŁ inequality is necessary and sufficient for linear convergence, in a well-defined sense. Sect. 3 lists some other situations where Algorithm 1 is linearly convergent. Moreover, we study the relationships between these situations. Finally, we conclude the paper with some remarks and questions for future research.

Notation

The n-dimensional Euclidean space is denoted by \({\mathbb {R}}^n\). Vectors are considered to be column vectors and the superscript T denotes the transpose operation. We use \(\langle \cdot , \cdot \rangle \) and \(\Vert \cdot \Vert \) to denote the Euclidean inner product and norm, respectively. For a matrix A, \(A_{ij}\) denotes its (ij)-th entry. The notation \(A\succeq 0\) means the matrix A is symmetric positive semi-definite, and \(\mathrm{tr}(A)\) stands for the trace of A.

2 Linear convergence under the PŁ inequality

This section studies linear convergence of the gradient descent under the PŁ inequality. It is readily seen that the PŁ inequality implies that every stationary point is a global minimum on X. By virtue of the descent lemma [19, Page 29], we have

$$\begin{aligned} f(x)-f^\star \ge \tfrac{1}{2L}\Vert \nabla f(x)\Vert ^2, \ \forall x\in {\mathbb {R}}^n. \end{aligned}$$

Hence, \(\mu _p\) can take value in (0, L]. On the other hand, we may assume without loss of generality \(\mu \le \mu _p\). The inequality is trivial if \(\mu \le 0\), and we therefore assume that \(\mu >0\). By taking the minimum with respect to y from both side of inequality (3), we get

$$\begin{aligned} f(x)-f^\star \le \tfrac{1}{2\mu } \Vert \nabla f(x)\Vert ^2. \end{aligned}$$

Hence, one may assume without loss of generality \(\mu _p=\max \{\mu , \mu _p\}\) in inequality (4).

In what follows, we employ performance estimation to get a new bound under the assumptions of Theorem 1. In this setting, the worst-case convergence rate of Algorithm 1 may be cast as the following optimization problem,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f(x^2)-f^\star }{f(x^1)-f^\star }\\ &{} x^2 \ \text {is generated by Algorithm 1 w.r.t.}\ f, x^1 \\ &{} f(x)\ge f^\star \ \forall x\in {\mathbb {R}}^n\\ &{} f(x)-f^\star \le \tfrac{1}{2\mu _p} \Vert \nabla f(x)\Vert ^2, \ \ \ \forall x\in X\\ &{} f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\\ &{} x^1\in {\mathbb {R}}^n. \end{array} \end{aligned}$$
(7)

In problem (7), f and \(x^1\) are decision variables and \(X=\{x: f(x)\le f(x^1)\}\). We may replace the infinite dimensional condition \(f \in {\mathcal {F}}_{\mu ,L}({\mathbb {R}}^n)\) by a finite set of constraints, by using interpolation. Theorem 2 gives some necessary and sufficient conditions for the interpolation of given data by some \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\).

Theorem 2

[21, Theorem 3.1] Let \(\{(x^i; g^i; f^i)\}_{i\in I}\subseteq {\mathbb {R}}^n \times {\mathbb {R}}^n \times {\mathbb {R}}\) with a given index set I and let \(L\in (0, \infty ]\) and \(\mu \in (-\infty , L)\). There exists a function \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) with

$$\begin{aligned} f(x^i) = f^i, \nabla f(x^i) = g^i \ \ i\in I, \end{aligned}$$
(8)

if and only if for every \(i, j\in I\)

$$\begin{aligned}&\tfrac{1}{2\left( 1-\tfrac{\mu }{L}\right) }\left( \tfrac{1}{L} \left\| g^i-g^j\right\| ^2+\mu \left\| x^i-x^j\right\| ^2-\tfrac{2\mu }{L} \left\langle g^j-g^i,x^j-x^i\right\rangle \right) \nonumber \\&\quad \le f^i-f^j-\left\langle g^j, x^i-x^j\right\rangle . \end{aligned}$$
(9)

It is worth noting that Theorem 2 addresses non-smooth functions as well. In fact, \(L=\infty \) covers non-smooth functions. Note that we only investigate the smooth case in this paper, that is \(L\in (0, \infty )\) and \(\mu \in (-\infty , 0]\).

By Theorem 2, problem (7) may be relaxed as follows,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^2-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. &{} \tfrac{1}{2(1-\tfrac{\mu }{L})} \left( \tfrac{1}{L}\left\| g^i-g^j\right\| ^2 +\mu \left\| x^i-x^j\right\| ^2-\tfrac{2\mu }{L} \left\langle g^j-g^i,x^j-x^i\right\rangle \right) \le \\ &{} f^i-f^j-\left\langle g^j, x^i-x^j\right\rangle \ \ i, j\in \left\{ 1, 2\right\} \\ &{} x^{2}=x^1-t_1 g^1 \\ &{} f^k\ge f^\star \ \ k\in \{1, 2\}\\ &{} f^k-f^\star \le \tfrac{1}{2\mu _p}\Vert g^k\Vert ^2, \ \ k\in \{1, 2\}. \end{array} \end{aligned}$$
(10)

As we replace the constraint \( f(x)-f^\star \le \tfrac{1}{2\mu _p} \Vert \nabla f(x)\Vert ^2\) for each \(x\in X\) by \( f^1-f^\star \le \tfrac{1}{2\mu _p}\Vert g^1\Vert ^2\) and \( f^2-f^\star \le \tfrac{1}{2\mu _p}\Vert g^2\Vert ^2\), problem (10) is a relaxation of problem (7). By using the constraint \(x^2=x^1-t_1g^1\), problem (10) may be reformulated as,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^2-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. &{} \frac{1}{2(L-\mu )}\left( \Vert g^2\Vert ^2 + (1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2+2(\mu t_1-1) \langle g^1,g^2\rangle \right) \\ &{} -f^2 + f^1 - \left\langle g^1,t_1g^1\right\rangle \le 0\\ &{} \frac{1}{2(L-\mu )}\left( \Vert g^2\Vert ^2+ (1+\mu Lt_1^2 -2\mu t_1)\Vert g^1\Vert ^2+2(\mu t_1-1)\langle g^1,g^2\rangle \right) \\ &{} -f^1 + f^2 + \left\langle g^2,t_1g^1\right\rangle \le 0 \\ &{} f^\star -f^k\le 0 \ \ k\in \left\{ 1, 2\right\} \\ &{} f^k-f^\star -\tfrac{1}{2\mu _p}\Vert g^k\Vert ^2\le 0, \ \ k\in \left\{ 1, 2\right\} . \end{array} \end{aligned}$$
(11)

By using the Gram matrix,

$$\begin{aligned} X=\begin{pmatrix} (g^1)^T \\ (g^2)^T \end{pmatrix} \begin{pmatrix} g^1&g^2 \end{pmatrix} =\begin{pmatrix} \Vert g^1\Vert ^1 &{} \langle g^1,g^2\rangle \\ \langle g^1,g^2\rangle &{} \Vert g^2\Vert ^2 \end{pmatrix}, \end{aligned}$$

problem (11) can be relaxed as follows,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^2-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. &{} \mathrm{tr}(A_1X)-f^2 + f^1 \le 0\\ &{} \mathrm{tr}(A_2X)-f^1 + f^2\le 0 \\ &{} f^1-f^\star +\mathrm{tr}(A_3X)\le 0 \\ &{} f^2-f^\star +\mathrm{tr}(A_4X)\le 0\\ &{} f^1,f^2\ge f^\star , X\succeq 0, \end{array} \end{aligned}$$
(12)

where

$$\begin{aligned} A_1&=\begin{pmatrix} \frac{1+\mu Lt_1^2-2\mu t_1}{2(L-\mu )}-t_1 &{} \frac{\mu t_1-1}{2(L-\mu )} \\ \frac{\mu t_1-1}{2(L-\mu )} &{} \frac{1}{2(L-\mu )} \end{pmatrix} \quad A_2=\begin{pmatrix} \frac{1+\mu Lt_1^2-2\mu t_1}{2(L-\mu )} &{} \frac{\mu t_1-1}{2(L-\mu )}+\frac{t_1}{2} \\ \frac{\mu t_1-1}{2(L-\mu )}+\frac{t_1}{2} &{} \frac{1}{2(L-\mu )} \end{pmatrix} \\ A_3&=\begin{pmatrix} \frac{-1}{\mu ^2_p} &{} 0 \\ 0 &{} 0 \end{pmatrix} \quad A_4=\begin{pmatrix} 0 &{} 0 \\ 0 &{} \frac{-1}{\mu ^2_p} \end{pmatrix}. \end{aligned}$$

In addition, \(X, f^1, f^2\) are decision variables in this formulation. In the next theorem, we obtain an upper bound for problem (11) by using weak duality. This bound gives a new convergence rate for Algorithm 1 for a wide variety of functions.

Theorem 3

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) with \(L\in (0, \infty ), \mu \in (-\infty , 0]\) and let f satisfy the PŁ inequality on \(X=\{x: f(x)\le f(x^1)\}\). Suppose that \(x^2\) is generated by Algorithm 1.

  1. i)

    If \(t_1\in \left( 0,\tfrac{1}{L}\right) \), then

    $$\begin{aligned}&\frac{f(x^2)-f^\star }{f(x^1)-f^\star }\\&\quad \le \left( \frac{\mu _p\left( 1-Lt_1\right) +\sqrt{\left( L-\mu \right) \left( \mu -\mu _p \right) \left( 2-L t_1\right) \mu _pt_1+\left( L-\mu \right) ^2}}{L-\mu +\mu _p}\right) ^2. \end{aligned}$$
  2. ii)

    If \(t_1\in \left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\right] \), then

    $$\begin{aligned} \frac{f(x^2)-f^\star }{f(x^1)-f^\star } \le \left( \frac{(Lt_1-2)(\mu t_1-2)\mu _p t_1}{\left( L+\mu -\mu _p\right) t_1-2}+1\right) . \end{aligned}$$
  3. iii)

    If \(t_1\in \left( \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}, \tfrac{2}{L}\right) \), then

    $$\begin{aligned} \frac{f(x^2)-f^\star }{f(x^1)-f^\star }\le \frac{(L t_1-1)^2}{(Lt_1-1)^2 +\mu _p t_1(2-Lt_1)}. \end{aligned}$$

In particular, if \(t_1=\tfrac{1}{L}\) and \(\mu =-L\), we have

$$\begin{aligned} f(x^2)-f^\star \le \left( \frac{2L-2\mu _p}{2L+\mu _p}\right) \left( f(x^1)-f^\star \right) . \end{aligned}$$
(13)

Proof

First we consider \(t_1\in \left( 0,\tfrac{1}{L}\right) \). Let

$$\begin{aligned}&b_1=\frac{\left( L-\mu \right) \left( \alpha +\mu _p \left( 1-L t_1\right) \right) }{\alpha \left( L-\mu +\mu _p\right) }\\&b_2=b_1-\left( \frac{\alpha }{L-\mu }b_1\right) ^2, \end{aligned}$$

where

$$\begin{aligned} \alpha =\sqrt{\left( L-\mu \right) \left( \mu _pt_1\left( \mu _p-\mu \right) \left( Lt_1-2\right) +\left( L-\mu \right) \right) }. \end{aligned}$$

It is readily seen that \(b_1,b_2\ge 0\). Furthermore,

$$\begin{aligned}&{f^2-f^\star }-(b_1-b_2)\left( {f^1-f^\star } \right) -b_2\left( -\frac{1}{2\mu _p}\left\| g^1\right\| ^2+f^1-f^\star \right) \\&\quad -\left( 1-b_1\right) \left( -\frac{1}{2\mu _p}\left\| g^2\right\| ^2 +f^2-f^\star \right) -b_1\left( \frac{1}{2(L-\mu )}\left( \Vert g^2\Vert ^2\right. \right. \\&\quad \left. \left. +(1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2+2(\mu t_1-1)\langle g^1,g^2\rangle \right) -f^1 + f^2 + \left\langle g^2,t_1g^1\right\rangle \right) \\&\quad =-\frac{1-Lt_1}{2\alpha } \left\| \frac{\alpha b_1}{L-\mu }g^1-g^2\right\| ^2\le 0. \end{aligned}$$

Therefore, for any feasible solution of problem (11), we have

$$\begin{aligned}&\frac{f(x^2)-f^\star }{f(x^1)-f^\star }\\&\quad \le \left( \frac{\mu _p\left( 1-Lt_1\right) +\sqrt{\left( L-\mu \right) \left( \mu -\mu _p \right) \left( 2-L t_1\right) \mu _pt_1+\left( L-\mu \right) ^2}}{L-\mu +\mu _p}\right) ^2, \end{aligned}$$

and the proof of this part is complete. Now, we consider the case that \(t_1\in \left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2 -L\mu +L^2}}\right] \). Suppose that

$$\begin{aligned}&a_1=\frac{\mu t_1-1}{\left( L+\mu -\mu _p\right) t_1-2}, \quad a_2=\frac{1-L t_1}{\left( L+\mu -\mu _p\right) t_1-2}, \\&a_3=-\frac{\left( (Lt_1-2)(\mu t_1-2)-1\right) \mu _pt_1}{\left( L+\mu -\mu _p\right) t_1-2}, \quad a_4=-\frac{\mu _p t_1}{\left( L+\mu -\mu _p\right) t_1-2}. \end{aligned}$$

It is readily seen that \(a_1,a_2,a_3,a_4\ge 0\). Furthermore,

$$\begin{aligned}&{f^2-f^\star }-\left( 1-a_3-a_4\right) \left( {f^1-f^\star } \right) -a_3\left( -\frac{1}{2\mu _p}\left\| g^1\right\| ^2+f^1-f^\star \right) \\&\quad - a_4\left( -\frac{1}{2\mu _p}\left\| g^2\right\| ^2+f^2-f^\star \right) -a_1\left( \frac{1}{2(L-\mu )}\left( \Vert g^2\Vert ^2+(1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2\right. \right. \\&\quad \left. \left. +2(\mu t_1-1)\langle g^1,g^2\rangle \right) -f^1 + f^2 +\left\langle g^2,t_1g^1\right\rangle \right) -a_2 \left( \frac{1}{2(L-\mu )}\left( \Vert g^2\Vert ^2\right. \right. \\&\quad \left. \left. + (1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2+2(\mu t_1-1) \langle g^1,g^2\rangle \right) -f^2 + f^1 -\left\langle g^1,t_1g^1\right\rangle \right) =0. \end{aligned}$$

Therefore, for any feasible solution of problem (11), we have

$$\begin{aligned} f(x^2)-f^\star - \left( \frac{L\mu _p\mu t_1^3-2\mu _p\left( L+\mu \right) t_1^2+4\mu _p t_1}{\left( L+\mu -\mu _p\right) t_1-2}+1 \right) \left( f(x^1)-f^\star \right) \le 0. \end{aligned}$$

Now, we prove the last part. Assume that \(t_1\in \left( \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}},\tfrac{2}{L}\right) \). With some algebra, one can show

$$\begin{aligned}&{f^2-f^\star }-\left( \frac{(L t_1-1)^2}{\beta }\right) \left( {f^1-f^\star } \right) -\left( \frac{\mu _p t_1(2-Lt_1)}{\beta }\right) \left( -\frac{1}{2\mu _p} \left\| g^2\right\| ^2+f^2-f^\star \right) \\&\qquad -\left( \frac{(L t_1-1)(2-Lt_1)}{\beta }\right) \\&\qquad \left( \frac{1}{2(L-\mu )} \left( \Vert g^2\Vert ^2+(1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2+2(\mu t_1-1) \langle g^1,g^2\rangle \right) \right. \\&\qquad \left. - f^2 + f^1 - \left\langle g^1,t_1g^1\right\rangle \right) -\left( \frac{L t_1-1}{\beta } \right) \left( \frac{1}{2(L-\mu )} \left( \Vert g^2\Vert ^2+ (1+\mu Lt_1^2-2\mu t_1)\Vert g^1\Vert ^2\right. \right. \\&\qquad \left. \left. + 2(\mu t_1-1)\langle g^1,g^2\rangle \right) -f^1 + f^2 + \left\langle g^2,t_1g^1\right\rangle \right) \\&\quad = -\frac{(1-L t_1) \left( L\mu t^2-2(\mu +L)t+3\right) }{2 \beta (L-\mu )} \left\| \sqrt{Lt_1-1}g^1+\tfrac{1}{\sqrt{Lt_1-1}} g^2\right\| ^2\le 0, \end{aligned}$$

where,

$$\begin{aligned} \beta =(Lt_1-1)^2+\mu _p t_1(2-Lt_1). \end{aligned}$$

The rest of the proof is similar to that of the former cases. \(\square \)

One may wonder how we obtain Lagrange multipliers (dual variables) in Theorem 3. The multipliers are computed by solving the dual of problem (12) by hand. Furthermore, Theorem 3 provides a tighter bound in comparison with the convergence rate given in Theorem 1 for L-smooth functions with \(t_1\in (0, \tfrac{2}{L})\). To show this, we need investigate three subintervals:

  1. i)

    Suppose that \(t_1\in \left( 0,\tfrac{1}{L}\right) \). As \(1-Lt_1\le 0\),

    $$\begin{aligned}&\left( \tfrac{\mu _p\left( 1-Lt_1\right) +\sqrt{2L\left( -L-\mu _p \right) \left( 2-L t_1\right) \mu _pt_1+4L^2}}{2L+\mu _p}\right) ^2\\&\quad \le \tfrac{4 L^2+2 L \mu _p t_1 (L+\mu _p) (L t_1-2) +(\mu _p-L \mu _p t_1)^2}{(2 L+\mu _p)^2}\le 1-t_1\mu _p(2-t_1L), \end{aligned}$$

    where the last inequality follows from non-positivity of the quadratic function \(T_1(t_1)=-L t_1^2 \left( 2 L^2+L \mu _p+\mu _p^2\right) +2 t_1 \left( 2 L^2+L \mu _p+\mu _p^2\right) -4 L\) on the given interval.

  2. ii)

    Let \(t_1\in \left[ \tfrac{1}{L}, \tfrac{\sqrt{3}}{L}\right] \). Since \(\mu _p\le L\) and \((2-Lt_1)> 0\), we have

    $$\begin{aligned} 1\le \tfrac{L t_1+2}{\mu _pt_1+2} \ \Rightarrow \ 1-\tfrac{(2-Lt_1)(L t_1+2) \mu _p t_1}{\mu _pt_1+2} \le 1-t_1\mu _p(2-Lt_1). \end{aligned}$$
  3. iii)

    Assume that \(t_1\in (\tfrac{\sqrt{3}}{L},\tfrac{2}{L})\). It is readily verified that the quadratic function \(T_2(t_1)=(Lt_1-1)^2+\mu _p t_1(2-Lt_1)-1\) is non-positive on the given interval. Hence,

    $$\begin{aligned} \tfrac{(L t_1-1)^2}{(Lt_1-1)^2+\mu _p t_1(2-Lt_1)} =1-\tfrac{\mu _p t_1(2-Lt_1)}{(Lt_1-1)^2+\mu _p t_1(2-Lt_1)} \le 1- t_1\mu _p(2-Lt_1). \end{aligned}$$

Therefore, for \(t_1\in \left( 0,\tfrac{2}{L}\right) \) the bound provided by Theorem 3 is tighter than that given by Theorem 2.

In most problems, the smoothness constant, L, is unknown. By using (2), any estimation of the smoothness constant L, say \(\tilde{L}\), should satisfy the following inequality,

$$\begin{aligned} f\left( x-\tfrac{1}{\tilde{L}} \nabla f(x)\right) \le f(x) -\tfrac{1}{2\tilde{L}} \Vert \nabla f(x)\Vert ^2. \end{aligned}$$

Thus one may try to obtain a suitable estimate by searching for a sufficiently large value of \(\tilde{L}\) that satisfies this inequality. This technique is due to Nesterov; see [18, Section 3] for details.

The next proposition gives the optimal step length with respect to the worst-case convergence rate.

Proposition 1

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) with \(L\in (0, \infty ), \mu \in (-\infty , 0]\) and let f satisfy the PŁ inequality on \(X=\{x: f(x)\le f(x^1)\}\). Suppose that \(r(t)=L\mu (L+\mu -\mu _p)t^3- \left( L^2-\mu _p (L+\mu )+5 L \mu +\mu ^2\right) t^2+4(L+\mu )t-4\) and \(\bar{t}\) is the unique root of r in \(\left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\right] \) if it exists. Then \(t^\star \) given by

$$\begin{aligned} t^\star = {\left\{ \begin{array}{ll} \bar{t} &{} \text { if } \bar{t} \text { exists}\\ \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}} &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

is the optimal step length for Algorithm 1 with respect to the worst-case convergence rate.

Proof

To obtain an optimal step length, we need to solve the optimization problem

$$\begin{aligned} \min _{t\in \left( 0, \tfrac{2}{L}\right) } h(t), \end{aligned}$$

where h is given by

$$\begin{aligned} h(t)= {\left\{ \begin{array}{ll} \left( \frac{\mu _p\left( 1-Lt\right) +\sqrt{\left( L-\mu \right) \left( \mu -\mu _p \right) \left( 2-L t\right) \mu _p t +\left( L-\mu \right) ^2}}{L-\mu +\mu _p}\right) ^2 &{} t\in \left( 0,\tfrac{1}{L}\right) \\ \frac{(Lt-2)(\mu t-2)\mu _p t}{\left( L+\mu -\mu _p\right) t-2}+1 &{} t\in \left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\right] \\ \frac{(L t-1)^2}{(Lt-1)^2+(2-Lt)\mu _p t} &{} t\in \left( \tfrac{3}{\mu +L +\sqrt{\mu ^2-L\mu +L^2}}, \tfrac{2}{L}\right) . \end{array}\right. } \end{aligned}$$

It is easily seen that h is decreasing on \(\left( 0,\tfrac{1}{L}\right) \) and is increasing on \(\left( \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}, \tfrac{2}{L}\right) \). Hence, we need investigate the closed interval \(\left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\right] \). We will show that h is convex on the interval in question. First, we consider the case \(L+\mu -\mu _p\le 0\). Let \(p(t)=\frac{\mu t-2}{\left( L+\mu -\mu _p\right) t-2}\) and \(q(t)=(Lt-2)\mu _p t\). By some algebra, one can show the following inequalities for \(t\in \left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2 -L\mu +L^2}}\right] \):

$$\begin{aligned}&p(t)\ge 0 \quad q(t)\le 0 \\&p^\prime (t)\ge 0 \quad q^\prime (t)\ge 0 \\&p^{\prime \prime }(t)\le 0 \quad q^{\prime \prime }(t)\ge 0. \end{aligned}$$

Hence, the convexity of h follows from \(h^{\prime \prime }=p^{\prime \prime }q+2p^{\prime }q^{\prime }+pq^{\prime \prime }\). Now, we investigate the case that \(L+\mu -\mu _p>0\). Suppose that \(p(t)=\frac{\mu _p t}{\left( L+\mu -\mu _p\right) t-2}\) and \(q(t)=(Lt-2)(\mu t-2)\). For these functions, we have the following inequalities

$$\begin{aligned}&p(t)\le 0 \quad q(t)\ge 0 \\&p^\prime (t)\le 0 \quad q^\prime (t)\le 0 \\&p^{\prime \prime }(t)\ge 0 \quad q^{\prime \prime }(t)\le 0, \end{aligned}$$

which analogous to the former case one can infer the convexity of h on the given interval. Hence, if h has a root in \(\left[ \tfrac{1}{L}, \tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\right] \), it will be the minimum. Otherwise, the point \(t^\star =\tfrac{3}{\mu +L+\sqrt{\mu ^2-L\mu +L^2}}\) will be the minimum. This follows from the point that \(h^\prime (\tfrac{1}{L})=\tfrac{2L\mu _p(\mu _p-L)}{(L+\mu _p-\mu )^2}\le 0\) and the convexity of h on the interval in question. \(\square \)

Thanks to Proposition 1, the following corollary gives the optimal step length for L-smooth convex functions satisfying the PŁ inequality.

Corollary 1

If f is an L-smooth convex function satisfying the PŁ inequality, then the optimal step length with respect to the worst-case convergence rate is \(\min \left\{ \tfrac{2}{L+\sqrt{L\mu _p}}, \tfrac{3}{2L}\right\} \).

The constant \(\tfrac{2}{L+\sqrt{L\mu _p}}\) also appears in the the fast gradient algorithm introduced in [17] for L-smooth convex functions which are \((1, \mu _s)\)-quasar-convex, see Definition 4. By Theorem 9, \((1, \mu _s)\)-quasar-convexity implies the PŁ inequality with the same constant. Algorithm 2 describes the method in question.

figure b

One can verify that Algorithm 2, at the first iteration, generates \(x^2=x^1-\tfrac{2}{L+\sqrt{L\mu _p}}\nabla f(x^1)\).

A more general form of the PŁ inequality, called the Łojasiewicz inequality, may be written as

$$\begin{aligned} \left( f(x)-f^\star \right) ^{2\theta } \le \tfrac{1}{2\mu _p} \Vert \nabla f(x)\Vert ^2, \ \ \ \forall x\in X, \end{aligned}$$
(14)

where \(\theta \in (0, 1)\). It is known that when \(\theta \in (0, \tfrac{1}{2}]\) some algorithms, including Algorithm 1, is linearly convergent; see [3, 4]. In the next theorem, we show that for functions with finite maximum and minimum curvature the Łojasiewicz inequality cannot hold for \(\theta \in (0, \tfrac{1}{2})\).

Theorem 4

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) be a non-constant function. If f satisfies the Łojasiewicz inequality on \(X=\{x: f(x)\le f(x^1)\}\), then \(\theta \ge \tfrac{1}{2}\).

Proof

To the contrary, assume that \(\theta \in (0, \tfrac{1}{2})\). Without loss of generality, we may assume that \(\mu =-L\). It is known that Algorithm 1 generates a decreasing sequence \(\{f(x^k)\}\) and it is convergent, that is \(\Vert \nabla f(x^k)\Vert \rightarrow 0\); see [19, page 28]. Furthermore, (14) implies that \(f(x^k)\rightarrow f^\star \). Without loss of generality, we may assume that \(f^\star =0\). First, we investigate the case that \(f(x^1)=1\). The semi-definite programming problem corresponding to performance estimation in this case may be formulated as follows,

$$\begin{aligned} \begin{array}{ll} \max &{} f^2\\ \mathrm{s.t}. &{} \mathrm{tr}(A_1X)-f^2 + 1 \le 0\\ &{} \mathrm{tr}(A_2X)-1 + f^2\le 0 \\ &{} 1+\mathrm{tr}(A_3X)\le 0 \\ &{} (f^2)^{2\theta }+\mathrm{tr}(A_4X)\le 0\\ &{} f^2\ge 0, X\succeq 0. \end{array} \end{aligned}$$
(15)

Since Algorithm 1 is a monotone method, \(f^2\) can take value in [0, 1]. In addition, we have \(f^2\le (f^2)^{2\theta }\) on this interval. Hence, by using Theorem 3, we get the following bound,

$$\begin{aligned} f^2\le \frac{2L-2\mu _p}{2L+\mu _p}. \end{aligned}$$

Now, suppose that \(f(x^1)=f^1>0\). Consider the function \(h: {\mathbb {R}}^n\rightarrow {\mathbb {R}}\) given by \(h(x)=\tfrac{f(x)}{f^1}\). It is seen that h is \(\tfrac{L}{f^1}\)-smooth and

$$\begin{aligned} h(x)^{2\theta } \le \tfrac{1}{2\mu _p(f^1)^{2\theta -2}} \Vert \nabla h(x)\Vert ^2, \ \ \ \forall x\in X. \end{aligned}$$

As Algorithm 1 generates the same \(x^2\) for both functions, by using the first part, we obtain

$$\begin{aligned} \frac{f(x^2)}{f(x^1)}\le \frac{2L(f^1)^{-1}-2\mu _p(f^1)^{2\theta -2}}{2L(f^1)^{-1} +\mu _p(f^1)^{2\theta -2}}=\frac{2L-2\mu _p(f^1)^{2\theta -1}}{2L+\mu _p(f^1)^{2\theta -1}}. \end{aligned}$$

For \(f^1\) sufficiently small, we have \(\frac{2L-2\mu _p(f^1)^{2\theta -1}}{2L+\mu _p(f^1)^{2\theta -1}}< 0\), which contradicts \(f^\star \ge 0\) and the proof is complete. \(\square \)

Necoara et al. gave necessary and sufficient conditions for linear convergence of the gradient method with constant step lengths when f is a smooth convex function; see [17, Theorem 13]. Indeed, the theorem says that Algorithm 1 is linearly convergent if and only if f has a quadratic functional growth on \(\{x: f(x)\le f(x^1)\}\); see Definition 3. However, this theorem does not hold necessarily for non-convex functions. The next theorem provides necessary and sufficient conditions for linear convergence of Algorithm 1.

Theorem 5

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\). Algorithm 1 is linearly convergent to the optimal value if and only if f satisfies PŁ inequality on \(\{x: f(x)\le f(x^1)\}\).

Proof

Let \(\bar{x}\in \{x: f(x)\le f(x^1)\}\). Linear convergence implies the existence of \(\gamma \in [0, 1)\) with

$$\begin{aligned} f(\hat{x})-f^\star \le \gamma \left( f(\bar{x})-f^\star \right) , \end{aligned}$$
(16)

where \(\hat{x}=\bar{x}-\tfrac{1}{L}\nabla f(\bar{x})\). By (3), we have \(f(\bar{x})- f(\hat{x})\le \tfrac{2L-\mu }{2L^2}\Vert \nabla f(\bar{x})\Vert ^2\). By using this inequality with (16), we get

$$\begin{aligned} f(\bar{x})-f^\star \le \tfrac{1}{1-\gamma } \left( f(\bar{x})-f(\hat{x})\right) \le \tfrac{2L-\mu }{2L^2(1-\gamma )}\Vert \nabla f(\bar{x})\Vert ^2, \end{aligned}$$

which shows that f satisfies PŁ inequality on \(\{x: f(x)\le f(x^1)\}\). The other implication follows from Theorem 3. \(\square \)

3 The PŁ inequality: relation to some classes of functions

In this section, we study some classes of functions for which Algorithm 1 may be linearly convergent. We establish that these classes of functions satisfy the PŁ inequality under mild assumptions, and we infer the linear convergence by using Theorem 3. Moreover, one can get convergence rates by applying performance estimation.

Throughout the section, we denote the optimal solution set of problem (1) by \(X^\star \) and we assume that \(X^\star \) is non-empty. We denote the distance function to \(X^\star \) by \(d_{X^\star }(x):=\inf _{y\in X^\star } \Vert y-x\Vert \). The set-valued mapping \(\Pi _{X^\star } (x)\) stands for the projection of x on \(X^\star \), that is, \(\Pi _{X^\star } (x)=\{y: \Vert y-x\Vert =d_{X^\star }(x)\}\). Note that, as \(X^\star \) is non-empty closed set, \(\Pi _{X^*}(x)\) exists and is well-defined.

Definition 2

Let \(\mu _g> 0\). A function f has a quadratic gradient growth on \(X\subseteq {\mathbb {R}}^n\) if

$$\begin{aligned} \langle \nabla f(x), x-x^\star \rangle \ge \mu _g d_{X^\star }^2 (x), \ \ \ \forall x\in X, \end{aligned}$$
(17)

for some \(x^\star \in \Pi _{X^\star }(x)\).

Note that inequality (2) implies that \(\mu _g\le L\). Hu et al. [15] investigated the convergence rate \(\{x^k\}\) when f satisfies (17) and \(X^\star \) is singleton. To the best knowledge of the authors, there is no convergence rate result in terms of \(\{f(x^k)\}\) for functions with a quadratic gradient growth. The next proposition states that quadratic gradient growth property implies the PŁ inequality.

Proposition 2

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\). If f has a quadratic gradient growth on \(X\subseteq {\mathbb {R}}^n\) with \(\mu _g> 0\), then f satisfies the PŁ inequality with \( \mu _p=\tfrac{\mu _g^2}{L}\).

Proof

Suppose that \(x^\star \in \Pi _{X^\star }(x)\) satisfies (17). By the Cauchy-Schwarz inequality, we have

$$\begin{aligned} \mu _g \Vert x-x^\star \Vert \le \Vert \nabla f(x)\Vert . \end{aligned}$$
(18)

On the other hand, (2) implies that

$$\begin{aligned} f(x)\le f(x^\star )+\tfrac{L}{2} \Vert x-x^\star \Vert ^2. \end{aligned}$$
(19)

The PŁ inequality follows from (18) and (19). \(\square \)

By Proposition 2 and Theorem 3, one can infer the linear convergence of Algorithm 1 when f has a quadratic gradient growth on \(X=\{x: f(x)\le f(x^1)\}\). Indeed, one can derive the following bound if \(t_1=\tfrac{1}{L}\) and \(\mu =-L\),

$$\begin{aligned} f(x^2)-f^\star \le \left( \frac{2L^2-2\mu _g^2}{2L^2+\mu _g^2}\right) \left( f(x^1)-f^\star \right) . \end{aligned}$$
(20)

Nevertheless, by using the performance estimation method, one can derive a better bound than the bound given by (20). The performance estimation problem for \(t_1=\tfrac{1}{L}\) in this case may be formulated as

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^{2}-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. &{} \{x^k, g^k, f^k\}\cup \{y^k, 0, f^\star \} \ \text {satisfy interpolation constraints}\ (9) \ \text {for}\ k\in \{1, 2\}\\ &{} x^{2}=x^1-\tfrac{1}{L} g^1 \\ &{} f^k\ge f^\star \ \ k\in \{1, 2\}\\ &{} \langle g^k, x^k-y^k\rangle \ge \mu _g \Vert y^k-x^k\Vert ^2, \ \ k\in \{1, 2\}\\ &{} \Vert x^1-y^1\Vert ^2 \le \Vert x^1-y^2\Vert ^2\\ &{} \Vert x^2-y^2\Vert ^2 \le \Vert x^2-y^1\Vert ^2. \end{array} \end{aligned}$$
(21)

Analogous to Sect. 2, one can obtain an upper bound for problem (21) by solving a semidefinite program. Our numerical results show that the bounds generated by performance estimation is tighter than bound (20); see Fig. 1. We do not have a closed-form bound on the optimal value of (21), though.

Fig. 1
figure 1

Convergence rate computed by performance estimation (red line) and the bound given by (20) (blue line) for \(\tfrac{\mu _g}{L}\in (0,1)\). (color figure online)

Definition 3

[17, Definition 4], [19, Definition 4.1.2] Let \(\mu _q> 0\). A function f has a quadratic functional growth on \(X\subseteq {\mathbb {R}}^n\) if

$$\begin{aligned} \tfrac{\mu _q}{2}d^2_{X^\star }(x) \le f(x)-f^\star , \ \ \ \forall x\in X. \end{aligned}$$
(22)

It is readily seen that, contrary to the previous situations, the quadratic functional growth property does not necessarily imply that each stationary point is a global optimal solution. The next theorem investigates the relationship between quadratic functional growth property and other notions.

Theorem 6

Let \(f\in {\mathcal {F}}_{\mu , L}({\mathbb {R}}^n)\) and let \(X=\{x: f(x)\le f(x^1)\}\). We have the following implications:

  1. i)

    (4) \(\Rightarrow \) (22) with \(\mu _q=\mu _p\).

  2. ii)

    If \(\mu _q>\tfrac{-\mu L}{L-\mu }\), then (22) \(\Rightarrow \) (17) with \(\mu _g=\tfrac{\mu _q}{2} (1-\tfrac{\mu }{L})+\tfrac{\mu }{2}\).

  3. iii)

    If

    $$\begin{aligned} f(x)-f(x^\star )\le \langle \nabla f(x), x-x^\star \rangle , \ \forall x\in X, \end{aligned}$$

    for some \(x^\star \in \Pi _{X^\star } (x)\) then (22) \(\Rightarrow \) (17) with \(\mu _g=\tfrac{\mu _q}{2}\).

Proof

One can establish i) similarly to the proof of [16, Theorem 2]. Consider part ii). Let \(x\in X\) and \(x^\star \in \Pi _{X^\star } (x)\) with \(d_{X^\star } (x)=\Vert x-x^\star \Vert \). By (9), we have

$$\begin{aligned} f(x)-f(x^\star )\le \tfrac{-1}{2(L-\mu )}\Vert \nabla f(x)\Vert ^2 -\tfrac{\mu L}{2(L-\mu )}\Vert x-x^\star \Vert ^2+\tfrac{L}{L-\mu } \langle \nabla f(x), x-x^\star \rangle . \end{aligned}$$

As \(\tfrac{\mu _q}{2}\Vert x-x^\star \Vert ^2 \le f(x)-f(x^\star )\), we get

$$\begin{aligned} \left( \tfrac{\mu _q}{2}\left( 1-\tfrac{\mu }{L}\right) +\tfrac{\mu }{2}\right) \left\| x-x^\star \right\| ^2 \le \langle \nabla f(x), x-x^\star \rangle , \end{aligned}$$

which establishes the desired inequality. Part iii) is proved similarly to the former case. \(\square \)

By Theorem 3, it is clear that Algorithm 1 enjoys linear convergence rate if f has a quadratic gradient growth on \(X=\{x: f(x)\le f(x^1)\}\) and if f satisfies assumptions ii) or iii) in Theorem 6. For instance, if \(\mu =-L\) and \(\mu _q\in (\tfrac{L}{2}, L)\), one can derive the following convergence rate for Algorithm 1 for fixed step length \(t_k=\tfrac{1}{L}\), \(k\in \{1, ..., N\}\),

$$\begin{aligned} f(x^{N+1})-f(x^1) \le \left( \frac{2L^2-2(\mu _q-\tfrac{L}{2})^2}{2L^2 +(\mu _q-\tfrac{L}{2})^2}\right) ^N \left( f(x^1)-f^\star \right) . \end{aligned}$$
(23)

It is interesting to compare the convergence rate (23) to the convergence rate obtained by using the performance estimation framework. In this case, the performance estimation problem may be cast as follows,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^{N+1}-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. &{} \{x^k, g^k, f^k\}\cup \{y^k, 0, f^\star \} \ \text {satisfy inequality }\ (9) \ \text {for}\ k\in \{1, ..., N+1\} \\ &{} x^{k+1}=x^k-\tfrac{1}{L} g^k, \ \ k\in \{1, ..., N\} \\ &{} f^k\ge f^\star \ \ k\in \{1, ..., N\}\\ &{} f^k-f^\star \ge \tfrac{\mu _q}{2} \Vert x^k-y^k\Vert ^2, \ \ k\in \{1, ..., N+1\}\\ &{} \Vert x^k-y^k\Vert ^2 \le \Vert x^k-y^{k^\prime }\Vert ^2, \ \ k\in \{1, ..., N+1\}, k^\prime \in \{1, ..., N+1\}. \end{array} \end{aligned}$$
(24)

Since \(x^{k+1}=x^k-\tfrac{1}{L} g^k\), we get \(x^{k+1}=x^1-\tfrac{1}{L} \sum _{l=1}^k g^l\). Hence, problem (24) may be reformulated as follows,

$$\begin{aligned} \begin{array}{ll} \max &{} \frac{f^{N+1}-f^\star }{f^1-f^\star }\\ \mathrm{s.t}. \ &{} \displaystyle \left\{ x^1-\tfrac{1}{L} \sum _{l=1}^{k-1} g^l, g^k, f^k\right\} \cup \{y^k, 0, f^\star \} \ \text {satisfy interpolation constraints}\ (9) \\ &{} f^k\ge f^\star \ \ k\in \{1, ..., N\}\\ &{} f^k-f^\star \ge \tfrac{\mu _q}{2} \left\| x^1-\tfrac{1}{L} \displaystyle \sum _{l=1}^{k-1} g^l-y^k\right\| ^2, \ \ k\in \{1, ..., N+1\}\\ &{} \left\| x^1-\tfrac{1}{L} \displaystyle \sum _{l=1}^{k-1} g^l-y^k\right\| ^2 \le \left\| x^1-\tfrac{1}{L} \displaystyle \sum _{l=1}^{k-1} g^l-y^{k^\prime }\right\| ^2, \ \ k, k^\prime \in \{1, ..., N+1\}. \end{array} \end{aligned}$$
(25)

The next theorem provides an upper bound for problem (25) by using weak duality.

Theorem 7

Let \(f\in {\mathcal {F}}_{-L, L}({\mathbb {R}}^n)\) and let f have a quadratic functional growth on \(X=\{x: f(x)\le f(x^1)\}\) with \(\mu _q\in (\tfrac{L}{2}, L)\). If \(t_k=\tfrac{1}{L}\), \(k\in \{1, ..., N\}\), then we have the following convergence rate for Algorithm 1,

$$\begin{aligned} f(x^{N+1})-f(x^1) \le \tfrac{L}{\mu _q} \left( 2-\tfrac{2\mu _q}{L}\right) ^N \left( f(x^1)-f^\star \right) . \end{aligned}$$
(26)

Proof

The proof is analogous to that of Theorem 3. Without loss of generality, we may assume that \(f^\star =0\). By some algebra, one can show that

$$\begin{aligned}&{f^{N+1}-f^\star }-\tfrac{L}{\mu _q} \left( 2-\tfrac{2\mu _q}{L}\right) ^N \left( {f^1-f^\star } \right) +\sum _{j=1}^{N+1} \left( 2^{N+1-j} \left( 1-\tfrac{\mu _q}{L}\right) ^{N-1}\right) \\&\quad \times \left( f^\star -f^j-\left\langle g^j, y^1-x^1 +\tfrac{1}{L} \sum _{l=1}^{j-1} g^l\right\rangle -\tfrac{1}{2L}\Vert g^j\Vert ^2 +\tfrac{L}{4} \left\| y^1-x^1+\tfrac{1}{L} \sum _{l=1}^{j-1} g^l +\tfrac{1}{L} g^j\right\| ^2 \right) \\&\quad + \sum _{i=2}^{N}\sum _{j=i}^{N+1} \left( 2^{N+1-j} \left( \tfrac{\mu _q}{L}\right) \left( 1-\tfrac{\mu _q}{L}\right) ^{N-i}\right) \left( f^\star -f^j-\langle g^j, y^i-x^1+\tfrac{1}{L} \sum _{l=1}^{j-1} g^l\rangle \right. \\&\quad \left. - \tfrac{1}{2L}\Vert g^j\Vert ^2+\tfrac{L}{4} \left\| y^i-x^1+\tfrac{1}{L} \sum _{l=1}^{j-1} g^l+\tfrac{1}{L} g^j\right\| ^2 \right) +\sum _{j=2}^{N} \left( 2^{N+1-j}(1-\tfrac{\mu _q}{L})^{N-j}\right) \\&\quad \times \left( f^j-f^\star -\tfrac{\mu _q}{2} \left\| y^j-x^1+\tfrac{1}{L} \sum _{l=1}^{j-1} g^l\right\| ^2 \right) +\left( 2^{N}\left( 1-\tfrac{\mu _q}{L}\right) ^{N-1} +\tfrac{L}{\mu _q} \left( 2-\tfrac{2\mu _q}{L}\right) ^N\right) \\&\quad \times \left( f^{1}-f^\star -\tfrac{\mu _q}{2} \Vert y^{1}-x^1\Vert ^2 \right) =-\left( \tfrac{L}{4}\left( 1-\tfrac{\mu _q}{L}\right) ^{N-1}\Vert y^1-x^1+\tfrac{1}{L} \sum _{l=1}^{N+1} g^l\Vert ^2\right) \\&\quad - \sum _{i=2}^N\left( \tfrac{\mu _q}{4}\left( 1-\tfrac{\mu _q}{L}\right) ^{N-i} \left\| y^i -x^1+\tfrac{1}{L}\sum _{l=1}^{N+1} g^l\right\| ^2\right) \le 0. \end{aligned}$$

By using the above inequality, we get

$$\begin{aligned} f^{N+1}-f^\star \le \tfrac{L}{\mu _q} \left( 2-\tfrac{2\mu _q}{L}\right) ^N \left( {f^1-f^\star } \right) , \end{aligned}$$

for any feasible point of (25), and the proof is complete. \(\square \)

By doing some calculus, one can verify the following inequality

$$\begin{aligned} \frac{2L^2-2\left( \mu _q-\tfrac{L}{2}\right) ^2}{2L^2+\left( \mu _q-\tfrac{L}{2}\right) ^2} \ge \left( 2-\tfrac{2\mu _q}{L}\right) , \ \ \mu _q\in \left( \tfrac{L}{2}, L\right) . \end{aligned}$$

Hence, Theorem 7 provides a tighter bound than (23).

Definition 4

[14, Definition 1] Let \(\gamma \in (0, 1]\) and \(\mu _s\ge 0\). A function f is called \((\gamma , \mu _s)\)-quasar-convex on \(X\subseteq {\mathbb {R}}^n\) with respect to \(x^\star \in \text {argmin}_{x\in {\mathbb {R}}^n} f(x)\) if

$$\begin{aligned} f(x)+\tfrac{1}{\gamma }\langle \nabla f(x), x^\star -x\rangle +\tfrac{\mu _s}{2}\Vert x^\star -x\Vert ^2 \le f^\star , \ \ \ \forall x\in X. \end{aligned}$$
(27)

The class of quasar-convex functions is large. For instance, non-negative homogeneous functions are (1, 0)-quasar-convex on \({\mathbb {R}}^n\). (Recall that a function \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is called homogeneous of degree k if \(f(\alpha x)=\alpha ^k f(x)\) for all \(x\in {\mathbb {R}}^n, \alpha \in {\mathbb {R}}\). ) Indeed, if f is non-negative homogeneous of degree \(k\ge 1\), by the Euler identity, we have

$$\begin{aligned} f(x)+\langle \nabla f(x), x^\star -x\rangle =(1-k) f(x) \le 0, \ \forall x\in {\mathbb {R}}^n, \end{aligned}$$

where \(x^\star =0\). In what follows, we list some convergence results concerning quasar-convex functions for Algorithm 1.

Theorem 8

[5, Remark 4.3] Let f be L-smooth and let f be \((\gamma , \mu _s)\)-quasar-convex on \(X=\{x: f(x)\le f(x^1)\}\). If \(t_1=\tfrac{1}{L}\) and if \(x^2\) is from Algorithm 1, then

$$\begin{aligned} f(x^2)-f^\star \le \left( 1-\tfrac{\gamma ^2\mu _s}{L}\right) \left( f(x^1)-f^\star \right) . \end{aligned}$$
(28)

In the following theorem, we state the relationship between quasar-convexity and other concepts. Before we get to the theorem, we recall star convexity. A set X is called star convex at \(x^\star \) if

$$\begin{aligned} \lambda x+(1-\lambda )x^\star \in X, \ \forall x\in X, \forall \lambda \in [0, 1]. \end{aligned}$$

Theorem 9

Let \(x^\star \) be the unique solution of problem (1) and let \(X=\{x: f(x)\le f(x^1)\}\). If X is star convex at \(x^\star \), then we have the following implications:

  1. i)

    (27) \(\Rightarrow \) (17) with \(\mu _g=\tfrac{\mu _s\gamma }{2}+\tfrac{\mu _s\gamma ^2}{4}\).

  2. ii)

    (17) \(\Rightarrow \) (27) with \(\mu _s=\ell -\tfrac{L}{2}\) and \(\gamma =\tfrac{\mu _g}{\ell }\) for each \(\ell \in (\max (\tfrac{L}{2}, \mu _g), \infty )\).

  3. iii)

    (27) \(\Rightarrow \) (4) with \(\mu _p=\mu _s\gamma ^2\).

Proof

The proof of i) is similar in spirit to the proof of Theorem 1 in [17]. Let \(x\in X\). By the fundamental theorem of calculus and (27),we have

$$\begin{aligned} f(x)-f(x^\star )&=\int _0^1 \tfrac{1}{\lambda }\langle \nabla f (\lambda x+(1-\lambda )x^\star ), \lambda x+(1-\lambda )x^\star -x^\star \rangle d\lambda \\&\ge \int _0^1 \tfrac{\gamma }{\lambda }\left( f(\lambda x +(1-\lambda )x^\star )-f(x^\star )+\tfrac{\mu _s\lambda ^2}{2} \Vert x-x^\star \Vert ^2\right) d\lambda \\&\ge \tfrac{\gamma \mu _s}{4}\Vert x-x^\star \Vert ^2, \end{aligned}$$

where the last inequality follows from the global optimality of \(x^\star \). By summing \(f(x)-f(x^\star )\ge \tfrac{\gamma \mu _s}{4}\Vert x-x^\star \Vert ^2\) and (27), we get the desired inequality. Now, we prove part ii). Let \(x\in {\mathbb {R}}^n\) and \(\ell \in (\max (\tfrac{L}{2}, \mu _g), \infty )\). By (2), we have

$$\begin{aligned} f(x)\le f(x^\star )+\tfrac{L}{2} \Vert x-x^\star \Vert ^2. \end{aligned}$$
(29)

By using (29) and (17), we get

$$\begin{aligned} f(x)+\left( \tfrac{\ell }{\mu _g}\right) \langle \nabla f(x), x^\star -x\rangle +\left( \ell -\tfrac{L}{2}\right) \Vert x-x^\star \Vert ^2 \le f(x^\star ). \end{aligned}$$

For the proof of iii), we refer the reader to [5, Lemma 3.2]. \(\square \)

By combining Theorem 3 and Theorem 9, under the assumptions of Theorem 8, one can get the following convergence rate for Algorithm 1 with \(t_1=\tfrac{1}{L}\),

$$\begin{aligned} f(x^2)-f^\star \le \left( \frac{2L-2\mu _s\gamma ^2}{2L+\mu _s\gamma ^2}\right) \left( f(x^1)-f^\star \right) , \end{aligned}$$

which is tighter the bound given in Theorem 8.

4 Concluding remarks

In this paper we studied the convergence rate of the gradient method with fixed step lengths for smooth functions satisfying the PŁ inequality. We gave a new convergence rate, which is sharper than known bounds in the literature. One important question which remains to be addressed is the computation of the tightest bound for Algorithm 1. Moreover, the performance analysis of fast gradient methods, like Algorithm 2, for these classes of functions may also be of interest.

We only studied the linear convergence in terms of the convergence of objective values. However, one can also infer the linear convergence in terms of distance to the solution set or the norm of the gradient by using our results. For instance, under the assumption of Theorem 3, we have

$$\begin{aligned} \tfrac{\mu _p}{2} d^2_{X^\star }(x^{k+1})\le f(x^{k+1})-f^\star \le \gamma ^k \left( f(x^1)-f^\star \right) \le \tfrac{L\gamma ^k }{2} d^2_{X^\star }(x^1), \end{aligned}$$

where the first inequality follows from Theorem 6, \(\gamma \) is the linear convergence rate given in Theorem 3, and the last inequality resulted from (2). Hence,

$$\begin{aligned} d^2_{X^\star }(x^{k+1})\le \tfrac{L\gamma ^k }{\mu _p} d^2_{X^\star }(x^1). \end{aligned}$$

Moreover, the quadratic gradient growth is a necessary and sufficient conditions for the linear convergence in terms of distance to the solution set; see [24, Theorem 3.4]. Note that the PŁ inequality and the quadratic gradient growth are equivalent.