1 Introduction

Fitting most machine learning models involves solving some sort of optimization problem. Gradient descent, and variants of it like coordinate descent and stochastic gradient, are the workhorse tools used by the field to solve very large instances of these problems. In this work we consider the basic problem of minimizing a smooth function and the convergence rate of gradient descent methods. It is well-known that if f is strongly-convex, then gradient descent achieves a global linear convergence rate for this problem [28]. However, many of the fundamental models in machine learning like least squares and logistic regression yield objective functions that are convex but not strongly-convex. Further, if f is only convex, then gradient descent only achieves a sub-linear rate.

This situation has motivated a variety of alternatives to strong convexity (SC) in the literature, in order to show that we can obtain linear convergence rates for problems like least squares and logistic regression. One of the oldest of these conditions is the error bounds (EB) of Luo and Tseng [22], but four other recently-considered conditions are essential strong convexity (ESC) [20], weak strong convexity (WSC) [25], the restricted secant inequality (RSI) [45], and the quadratic growth (QG) condition [2]. Some of these conditions have different names in the special case of convex functions: a convex function satisfying RSI is said to satisfy restricted strong convexity (RSC) [45] while a convex function satisfying QG is said to satisfy optimal strong convexity (OSC) [19] or (confusingly) WSC [23]. The proofs of linear convergence under all of these relaxations are typically not straightforward, and it is rarely discussed how these conditions relate to each other.

In this work, we consider a much older condition that we refer to as the Polyak-Łojasiewicz (PL) inequality. This inequality was originally introduced by Polyak [31], who showed that it is a sufficient condition for gradient descent to achieve a linear convergence rate. We describe it as the PL inequality because it is also a special case of the inequality introduced in the same year by Łojasiewicz [21]. We review the PL inequality in the next section and how it leads to a trivial proof of the linear convergence rate of gradient descent. Next, in terms of showing a global linear convergence rate to the optimal solution, we show that the PL inequality is weaker than all of the more recent conditions discussed in the previous paragraph. This suggests that we can replace the long and complicated proofs under any of the conditions above with simpler proofs based on the PL inequality. Subsequently, we show how this result implies gradient descent achieves linear rates for standard problems in machine learning like least squares and logistic regression that are not necessarily SC, and even for some non-convex problems (Sect. 2.3). In Sect. 3 we use the PL inequality to give new convergence rates for randomized and greedy coordinate descent (implying a new convergence rate for certain variants of boosting), sign-based gradient descent methods, and stochastic gradient methods in either the classical or variance-reduced setting. Next we turn to the closely-related problem of minimizing the sum of a smooth function and a simple non-smooth function. We propose a generalization of the PL inequality that allows us to show linear convergence rates for proximal-gradient methods without SC. This leads to a simple analysis showing linear convergence of methods for training support vector machines. It also implies that we obtain a linear convergence rate for \(\ell _1\)-regularized least squares problems, showing that the extra conditions previously assumed to derive linear converge rates in this setting are in fact not needed.

2 Polyak-Łojasiewicz Inequality

We first focus on the basic unconstrained optimization problem

$$\begin{aligned} \mathop {\hbox {argmin}}_{x \in \mathbb {R}^d}~\! f(x), \end{aligned}$$
(1)

and we assume that the first derivative of f is L-Lipschitz continuous. This means that

$$\begin{aligned} f(y) \le f(x) + \langle \nabla f(x) , y -x\rangle + \frac{L }{ 2} || y - x ||^2, \end{aligned}$$
(2)

for all x and y. For twice-differentiable objectives this assumption means that the eigenvalues of \(\nabla ^2 f(x)\) are bounded above by some L, which is typically a reasonable assumption. We also assume that the optimization problem has a non-empty solution set \(\mathcal {X}^*\), and we use \(f^*\) to denote the corresponding optimal function value. We will say that a function satisfies the PL inequality if the following holds for some \(\mu > 0\),

$$\begin{aligned} \frac{1}{2}|| \nabla f(x) ||^2 \ge \mu (f(x) - f^*), \quad \forall ~x. \end{aligned}$$
(3)

This inequality simply requires that the gradient grows faster than a quadratic function as we move away from the optimal function value. Note that this inequality implies that every stationary point is a global minimum. But unlike SC, it does not imply that there is a unique solution. Linear convergence of gradient descent under these assumptions was first proved by Polyak [31]. Below we give a simple proof of this result when using a step-size of 1/L.

Theorem 1

Consider problem (1), where f has an L-Lipschitz continuous gradient (2), a non-empty solution set \(\mathcal {X}^*\), and satisfies the PL inequality (3). Then the gradient method with a step-size of 1/L,

$$\begin{aligned} x_{k+1} = x_k - \frac{1 }{ L} \nabla f(x_k), \end{aligned}$$
(4)

has a global linear convergence rate,

$$\begin{aligned} f(x_k) - f^* \le \left( 1- \frac{\mu }{ L}\right) ^k (f(x_0) - f^*). \end{aligned}$$

Proof

By using update rule (4) in the Lipschitz inequality condition (2) we have

$$\begin{aligned} f(x_{k+1}) -f(x_k) \le - \frac{1 }{ 2L} ||\nabla f(x_k)||^2. \end{aligned}$$

Now by using the PL inequality (3) we get

$$\begin{aligned} f(x_{k+1}) -f(x_k)\le & {} - \frac{\mu }{ L} (f(x_k) - f^*). \end{aligned}$$

Re-arranging and subtracting \(f^*\) from both sides gives us \(f(x_{k+1}) - f^* \le \left( 1- \frac{\mu }{ L}\right) (f(x_k) - f^*)\). Applying this inequality recursively gives the result.    \(\square \)

Note that the above result also holds if we use the optimal step-size at each iteration, because of the inequality

$$\begin{aligned} \min _{\alpha }f(x_k - \alpha \nabla f(x_k)) \le f\left( x_k - \frac{1}{L}\nabla f(x_k)\right) . \end{aligned}$$

A beautiful aspect of this proof is its simplicity; in fact it is simpler than the proof of the same fact under the usual SC assumption. It is certainly simpler than typical proofs which rely on the other conditions mentioned in Sect. 1. Further, it is worth noting that the proof does not assume convexity of f. Thus, this is one of the few general results we have for global linear convergence on non-convex problems.

2.1 Relationships Between Conditions

As mentioned in the Sect. 1, several other assumptions have been explored over the last 25 years in order to show that gradient descent achieves a linear convergence rate. These typically assume that f is convex, and lead to more complicated proofs than the one above. However, it is rarely discussed how the conditions relate to each other. Indeed, all of the relationships that have been explored have only been in the context of convex functions [19, 25, 44]. In Appendix 2.1, we give the precise definitions of all conditions and also prove the result below giving relationships between the conditions.

Theorem 2

For a function f with a Lipschitz-continuous gradient, the following implications hold:

$$\begin{aligned} (SC) \rightarrow (ESC) \rightarrow (WSC) \rightarrow (RSI) \rightarrow (EB) \equiv (PL) \rightarrow (QG). \end{aligned}$$

If we further assume that f is convex then we have

$$\begin{aligned} (RSI) \equiv (EB) \equiv (PL) \equiv (QG). \end{aligned}$$

This result shows that (QG) is the weakest assumption among those considered. However, QG allows non-global local minima so it is not enough to guarantee that gradient descent finds a global minimizer. This means that, among those considered above, PL and the equivalent EB are the most general conditions that allow linear convergence to a global minimizer. Note that in the convex case QG is called OSC, but the result above shows that in the convex case it is also equivalent to EB and PL (as well as RSI which is known as RSC in this case).

2.2 Invex and Non-convex Functions

While the PL inequality does not imply convexity of f, it does imply the weaker condition of invexity. Invexity was first introduced by Hanson in 1981 [12], and has been used in the context of learning output kernels [8]. Craven and Glover [7] show that a smooth f is invex if and only if every stationary point of f is a global minimum. Since the PL inequality implies that all stationary points are global minimizers, functions satisfying the PL inequality must be invex. Indeed, Theorem 2 shows that all of the previous conditions except (QG) imply invexity. The function \(f(x) = x^2 + 3\sin ^2(x)\) is an example of an invex but non-convex function satisfying the PL inequality (with \(\mu = 1/32\)). Thus, Theorem 1 implies gradient descent obtains a global linear convergence rate on this function.

Unfortunately, many complicated models have non-optimal stationary points. For example, typical deep feed-forward neural networks have sub-optimal stationary points and are thus not invex. A classic way to analyze functions like this is to consider a global convergence phase and a local convergence phase. The global convergence phase is the time spent to get “close” to a local minimum, and then once we are “close” to a local minimum the local convergence phase characterizes the convergence rate of the method. Usually, the local convergence phase starts to apply once we are locally SC around the minimizer. But this means that the local convergence phase may be arbitrarily small: for example, for \(f(x) = x^2 + 3\sin ^2(x)\) the local convergence rate would not even apply over the interval \(x \in [-1,1]\). If we instead defined the local convergence phase in terms of locally satisfying the PL inequality, then we see that it can be much larger (\(x \in \mathrm{I\!R}\) for this example).

2.3 Relevant Problems

If f is \(\mu \)-SC, then it also satisfies the PL inequality with the same \(\mu \) (see Appendix 2.3). Further, by Theorem 2, f satisfies the PL inequality if it satisfies any of ESC, WSC, RSI, or EB (while for convex f, QG is also sufficient). Although it is hard to precisely characterize the general class of functions for which the PL inequality is satisfied, we note one important special case below.

Strongly-convex composed with linear: This is the case where f has the form \(f(x) = g(Ax)\) for some \(\sigma \)-SC function g and some matrix A. In Appendix 2.3, we show that this class of functions satisfies the PL inequality, and we note that this form frequently arises in machine learning. For example, least squares problems have the form

$$\begin{aligned} f(x) = \Vert Ax-b\Vert ^2, \end{aligned}$$

and by noting that \(g(z) \triangleq \Vert z-b\Vert ^2\) is SC we see that least squares falls into this category. Indeed, this class includes all convex quadratic functions.

In the case of logistic regression we have

$$\begin{aligned} f(x) = \sum _{i=1}^n \log (1 + \exp (b_ia_i^Tx)). \end{aligned}$$

This can be written in the form g(Ax), where g is strictly convex but not SC. In cases like this where g is only strictly convex, the PL inequality will still be satisfied over any compact set. Thus, if the iterations of gradient descent remain bounded, the linear convergence result still applies. It is reasonable to assume that the iterates remain bounded when the set of solutions is finite, since each step must decrease the objective function. Thus, for practical purposes, we can relax the above condition to “strictly-convex composed with linear” and the PL inequality implies a linear convergence rate for logistic regression.

3 Convergence of Huge-Scale Methods

In this section, we use the PL inequality to analyze several variants of two of the most widely-used techniques for handling large-scale machine learning problems: coordinate descent and stochastic gradient methods. In particular, the PL inequality yields very simple analyses of these methods that apply to more general classes of functions than previously analyzed. We also note that the PL inequality has recently been used by Garber and Hazan [9] to analyze the Frank-Wolfe algorithm. Further, inspired by the resilient backpropagation (RPROP) algorithm of Riedmiller and Braun [32], in Appendix 3 we also give the first convergence rate analysis for sign-based gradient descent methods.

3.1 Randomized Coordinate Descent

Nesterov [29] shows that randomized coordinate descent achieves a faster convergence rate than gradient descent for problems where we have d variables and it is d times cheaper to update one coordinate than it is to compute the entire gradient. The expected linear convergence rates in this previous work rely on SC, but in this section we show that randomized coordinate descent achieves an expected linear convergence rate if we only assume that the PL inequality holds.

To analyze coordinate descent methods, we assume that the gradient is coordinate-wise Lipschitz continuous, meaning that for any x and y we have

$$\begin{aligned} f(x + \alpha e_i) \le f(x) + \alpha \nabla _i f(x) + \frac{L }{ 2} \alpha ^2, \quad \forall \alpha \in \mathbb {R}, \quad \forall x \in \mathbb {R}^d, \end{aligned}$$
(5)

for any coordinate i, and where \(e_i\) is the ith unit vector.

Theorem 3

Consider problem (1), where f has a coordinate-wise L-Lipschitz continuous gradient (5), a non-empty solution set \(\mathcal {X}^*\), and satisfies the PL inequality (3). Consider the coordinate descent method with a step-size of 1/L,

$$\begin{aligned} x_{k+1} = x_k - \frac{1 }{ L} \nabla _{i_k} f(x_k) e_{i_k}. \end{aligned}$$
(6)

If we choose the variable to update \(i_k\) uniformly at random, then the algorithm has an expected linear convergence rate of

$$\begin{aligned} \mathbb {E}[ f(x_k) - f^*] \le \left( 1 - \frac{\mu }{ dL}\right) ^k[ f(x_0) - f^*]. \end{aligned}$$

Proof

By using the update rule (6) in the Lipschitz condition (5) we have

$$\begin{aligned} f(x_{k+1}) \le f(x_k) - \frac{1 }{ 2L} || \nabla _{i_k} f(x_k) ||^2. \end{aligned}$$

By taking the expectation of both sides with respect to \(i_k\) we have

$$\begin{aligned} \mathbb {E} \left[ f(x_{k+1}) \right]\le & {} f(x_k) - \frac{1 }{ 2L} \mathbb {E} \left[ || \nabla _{i_k} f(x_k) ||^2 \right] \nonumber \\\le & {} f(x_k) - \frac{1 }{ 2L} \sum _i \frac{1}{ d}|| \nabla _{i} f(x_k) ||^2 \nonumber \\= & {} f(x_k) - \frac{1 }{ 2dL} ||\nabla f(x_k) ||^2. \end{aligned}$$

By using the PL inequality (3) and subtracting \(f^*\) from both sides, we get

$$\begin{aligned} \mathbb {E}[ f(x_{k+1}) - f^*] \le \left( 1 - \frac{\mu }{ dL}\right) [f(x_k) - f^*]. \end{aligned}$$

Applying this recursively and using iterated expectations yields the result.    \(\square \)

As before, instead of using 1/L we could perform exact coordinate optimization and the result would still hold. If we have a Lipschitz constant \(L_i\) for each coordinate and sample proportional to the \(L_i\) as suggested by Nesterov [29], then the above argument (using a step-size of \(1/L_{i_k}\)) can be used to show that we obtain a faster rate of

$$\begin{aligned} \mathbb {E}[ f(x_{k}) - f^*] \le \left( 1 - \frac{\mu }{ d\bar{L}}\right) ^k[f(x_0) - f^*], \end{aligned}$$

where \(\bar{L} = \frac{1}{d}\sum _{j=1}^d L_j\).

3.2 Greedy Coordinate Descent

Nutini et al. [30] have recently analyzed coordinate descent under the greedy Gauss-Southwell (GS) rule, and argued that this rule may be suitable for problems with a large degree of sparsity. The GS rule chooses \(i_k\) according to the rule \(i_k = \hbox {argmax}_j |\nabla _j f(x_k)|\). Using the fact that

$$\begin{aligned} \max _i |\nabla _i f(x_k)| \ge \frac{1}{d} \sum _{i=1}^d|\nabla _i f(x_k)|, \end{aligned}$$

it is straightforward to show that the GS rule satisfies the rate above for the randomized method.

However, Nutini et al. [30] show that a faster convergence rate can be obtained for the GS rule by measuring SC in the 1-norm. Since the PL inequality is defined on the dual (gradient) space, in order to derive an analogous result we could measure the PL inequality in the \(\infty \)-norm,

$$\begin{aligned} \Vert \nabla f(x)\Vert _\infty ^2 \ge 2\mu _1(f(x) - f^*). \end{aligned}$$

Because of the equivalence between norms, this is not introducing any additional assumptions beyond that the PL inequality is satisfied. Further, if f is \(\mu _1\)-SC in the 1-norm, then it satisfies the PL inequality in the \(\infty \)-norm with the same constant \(\mu _1\). By using that \(|\nabla _{i_k}f(x_k)| = \Vert \nabla f(x_k)\Vert _\infty \) when the GS rule is used, the above argument can be used to show that coordinate descent with the GS rule achieves a convergence rate of

$$\begin{aligned} f(x_k) - f^* \le \left( 1 - \frac{\mu _1}{L}\right) ^k[f(x_0) - f^*], \end{aligned}$$

when the function satisfies the PL inequality in the \(\infty \)-norm with a constant of \(\mu _1\). By the equivalence between norms we have that \(\mu /d \le \mu _1\), so this is faster than the rate with random selection.

Meir and Rätsch [24] show that we can view some variants of boosting algorithms as implementations of coordinate descent with the GS rule. They use the error bound property to argue that these methods achieve a linear convergence rate, but this property does not lead to an explicit rate. Our simple result above thus provides the first explicit convergence rate for these variants of boosting.

3.3 Stochastic Gradient Methods

Stochastic gradient (SG) methods apply to the general stochastic optimization problem

$$\begin{aligned} \mathop {\hbox {argmin}}_{x \in \mathrm{I\!R}^d}~\! f(x) = \mathbb {E}[f_i(x)], \end{aligned}$$
(7)

where the expectation is taken with respect to i. These methods are typically used to optimize finite sums,

$$\begin{aligned} f(x) = \frac{1 }{ n}\sum _i^n f_i(x). \end{aligned}$$
(8)

Here, each \(f_i\) typically represents the fit of a model on an individual training example. SG methods are suitable for cases where the number of training examples n is so large that it is infeasible to compute the gradient of all n examples more than a few times.

Stochastic gradient (SG) methods use the iteration

$$\begin{aligned} x_{k+1} = x_k - \alpha _k \nabla f_{i_k}(x_k), \end{aligned}$$
(9)

where \(\alpha _k\) is the step size and \(i_k\) is a sample from the distribution over i so that \(\mathbb {E}[\nabla f_{i_k}(x_k)] = \nabla f(x_k)\). Below, we analyze the convergence rate of stochastic gradient methods under standard assumptions on f, and under both a decreasing and a constant step-size scheme.

Theorem 4

Consider problem (7). Assume that each f has an L-Lipschitz continuous gradient (2), f has a non-empty solution set \(\mathcal {X}^*\), f satisfies the PL inequality (3), and \(\mathbb {E}[\Vert \nabla f_i(x_k)\Vert ^2] \le C^2\) for all \(x_k\) and some C. If we use the SG algorithm (9) with \(\alpha _k = \frac{2k+1 }{ 2\mu (k+1)^2}\), then we get a convergence rate of

$$\begin{aligned} \mathbb {E}[f(x_k) - f^*] \le \frac{L C^2 }{ 2 k\mu ^2}. \end{aligned}$$

If instead we use a constant \(\alpha _k = \alpha < \frac{1}{2\mu }\), then we obtain a linear convergence rate up to a solution level that is proportional to \(\alpha \),

$$\begin{aligned} \mathbb {E}[f(x_k) - f^*] \le (1-2\mu \alpha )^k[f(x_0) - f^*] + \frac{LC^2\alpha }{4\mu }. \end{aligned}$$

Proof

By using the update rule (9) inside the Lipschitz condition (2), we have

$$\begin{aligned} f(x_{k+1}) \le f(x_k) - \alpha _k \langle f'(x_k), \nabla f_{i_k}(x_k) \rangle + \frac{L \alpha _k^2 }{ 2} || \nabla f_{i_k}(x_k)||^2. \end{aligned}$$

Taking the expectation of both sides with respect to \(i_k\) we have

$$\begin{aligned} \mathbb {E}[f(x_{k+1})]&\le f(x_k) - \alpha _k \langle \nabla f(x_k), \mathbb {E} \left[ \nabla f_{i_k}(x_k) \right] \rangle + \frac{L \alpha _k^2 }{ 2}\mathbb {E}[\Vert \nabla f_i(x_k)\Vert ^2] \\&\le f(x_k) - \alpha _k || f'(x_k)||^2 + \frac{L C^2 \alpha _k^2 }{ 2} \\&\le f(x_k) - 2\mu \alpha _k ( f(x_k) - f^*) + \frac{L C^2 \alpha _k^2 }{ 2}, \end{aligned}$$

where the second line uses that \(\mathbb {E}[ \nabla f_{i_k}(x_k)] = f'(x_k)\) and \(\mathbb {E}[\Vert \nabla f_i(x_k)\Vert ^2] \le C^2\), and the third line uses the PL inequality. Subtracting \(f^*\) from both sides yields:

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f^*] \le (1 - 2\alpha _k \mu )[f(x_{k}) - f^*] + \frac{L C^2 \alpha _k^2 }{ 2}. \end{aligned}$$
(10)

Decreasing step size: With \(\alpha _k = \frac{2k + 1 }{ 2\mu (k+1)^2}\) in (10) we obtain

$$\begin{aligned}&\mathbb {E}[f(x_{k+1}) - f^*] \le \frac{k^2 }{ (k+1)^2}[f(x_{k}) - f^*] + \frac{L C^2 (2k+1)^2 |}{ 8 \mu ^2 (k+1)^4}.&\nonumber \end{aligned}$$

Multiplying both sides by \((k+1)^2\) and letting \(\delta _f(k) \equiv k^2 \mathbb {E}[f(x_{k}) - f^*]\) we get

$$\begin{aligned} \delta _f(k+1)&\le \delta _f(k) + \frac{L C^2 (2k+1)^2 }{ 8 \mu ^2 (k+1)^2} \nonumber \\&\le \delta _f(k) + \frac{LC^2 }{ 2 \mu ^2}, \end{aligned}$$

where the second line follows from \(\frac{2k+1 }{ k+1 } < 2\). Summing up this inequality from \(k=0\) to k and using the fact that \(\delta _f(0) = 0\) we get

$$\begin{aligned}&\delta _f(k+1) \le \delta _f(0) + \frac{L C^2}{ 2 \mu ^2} \sum \nolimits _{i=0}^k 1 \le \frac{L C^2 (k+1)}{ 2 \mu ^2} \nonumber \\ \Rightarrow \quad&(k+1)^2 \mathbb {E}[f(x_{k+1}) - f^*] \le \frac{L C^2 (k+1)}{ 2 \mu ^2} \end{aligned}$$

which gives the stated rate.

Constant step size: Choosing \(\alpha _k = \alpha \) for any \(\alpha < 1/2\mu \) and applying (10) recursively yields

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f^*]&\le (1 - 2\alpha \mu )^k[f(x_{0}) - f^*] + \frac{L C^2 \alpha ^2 }{ 2} \sum _{i=0}^k (1 - 2\alpha \mu )^i \\&\le (1 - 2\alpha \mu )^k[f(x_{0}) - f^*] + \frac{L C^2 \alpha ^2 }{ 2} \sum _{i=0}^{\infty } (1 - 2\alpha \mu )^i \\&= (1 - 2\alpha \mu )^k[f(x_{0}) - f^*] + \frac{L C^2 \alpha }{ 4 \mu }, \end{aligned}$$

where the last line uses that \(\alpha < 1/2\mu \) and the limit of the geometric series.    \(\square \)

The O(1/k) rate for a decreasing step size matches the convergence rate of stochastic gradient methods under SC [27]. It was recently shown using a non-trivial analysis that a stochastic Newton method could achieve an O(1/k) rate for least squares problems [4], but our result above shows that the basic stochastic gradient method already achieves this property (although the constants are worse than for this Newton-like method). Further, our result does not rely on convexity. Note that if we are happy with a solution of fixed accuracy, then the result with a constant step-size is perhaps the more useful strategy in practice: it supports the often-used empirical strategy of using a constant size for a long time, then halving the step-size if the algorithm appears to have stalled (the above result indicates that halving the step-size will at least halve the sub-optimality).

3.4 Finite Sum Methods

In the setting of minimizing finite sums, it has recently been shown that there are methods that have the low iteration cost of stochastic gradient methods but that still have linear convergence rates [33]. While the first methods that achieved this remarkable property required a memory of previous gradient values, the stochastic variance-reduced gradient (SVRG) method of Johnson and Zhang [16] does not have this drawback. In Appendix 3.4, we give a new analysis of the SVRG method that shows that it achieves a linear convergence rate under the PL inequality. Similar results for finite-sum methods under the PL inequality recently appeared in the works of Reddi et al. [36, 37]. Garber and Hazan [10] have also given a related result in the context of an improved algorithm for principal component analysis (PCA), showing that the \(f_i\) do not need to be convex in order to achieve a linear convergence rate. However, their result still assumes that f is SC while our analysis only assumes the PL inequality is satisfied.

4 Proximal-Gradient Generalization

Attouch and Bolte [3] consider a generalization of the PL inequality due to Kurdyak to give conditions under which the classic proximal-point algorithm achieves a linear convergence rate for non-smooth problems (called the KL inequality). However, in practice proximal-gradient methods are more relevant to many machine learning problems. While the KL inequality has been used to show local linear convergence of proximal-gradient methods [6, 18], in this section we propose a different generalization of the PL inequality that yields a simple global linear convergence analysis.

Proximal-gradient methods apply to problems of the form

$$\begin{aligned} \mathop {\hbox {argmin}}_{x \in \mathbb {R}^d}~\! F(x) = f(x) + g(x), \end{aligned}$$
(11)

where f is a differentiable function with an L-Lipschitz continuous gradient and g is a simple but potentially non-smooth convex function. Typical examples of simple functions g include a scaled \(\ell _1\)-norm of the parameter vectors, \(g(x) = \lambda \Vert x\Vert _1\), and indicator functions that are zero if x lies in a simple convex set and are infinity otherwise.

In order to analyze proximal-gradient algorithms, a natural (though not particularly intuitive) generalization of the PL inequality is that there exists a \(\mu > 0\) satisfying

$$\begin{aligned} \frac{1}{2}\mathcal {D}_g(x,L) \ge \mu (F(x) - F^*), \end{aligned}$$
(12)

where

$$\begin{aligned} \mathcal {D}_g(x,\alpha ) \equiv -2\alpha \min _y [ \langle \nabla f(x) , y-x \rangle + \frac{\alpha }{2}||y-x||^2 + g(y) - g(x) ]. \end{aligned}$$
(13)

We call this the proximal-PL inequality, and we note that if g is constant (or linear) then it reduces to the standard PL inequality. Below we show that this inequality is sufficient for the proximal-gradient method to achieve a global linear convergence rate.

Theorem 5

Consider problem (11), where f has an L-Lipschitz continuous gradient (2), F has a non-empty solution set \(\mathcal {X}^*\), g is convex, and F satisfies the proximal-PL inequality (12). Then the proximal-gradient method with a step-size of 1/L,

$$\begin{aligned} x_{k+1} = \mathop {\hbox {argmin}}_{y}~\! [ \langle \nabla f(x_k) , y-x_k \rangle + \frac{L }{ 2}||y-x_k||^2 + g(y) - g(x_k) ] \end{aligned}$$
(14)

converges linearly to the optimal value \(F^*\),

$$\begin{aligned} F(x_{k}) - F^* \le \left( 1 - \frac{\mu }{ L}\right) ^k[F(x_{0}) - F^*]. \end{aligned}$$

Proof

By using Lipschitz continuity of the function f we have

$$\begin{aligned} F(x_{k+1})&= f(x_{k+1}) + g(x_k) + g(x_{k+1}) - g(x_k) \nonumber \\&\le F(x_k) + \langle \nabla f(x_k) , x_{k+1}-x_k \rangle + \frac{L }{ 2}||x_{k+1}-x_k||^2 + g(x_{k+1}) - g(x_k) \nonumber \\&\le F(x_k) - \frac{1 }{ 2L} \mathcal {D}_g(x_k,L)\\&\le F(x_k) - \frac{\mu }{L}[F(x_k) - F^*], \end{aligned}$$

which uses the definition of \(x_{k+1}\) and \(\mathcal {D}_g\) followed by the proximal-PL inequality (12). This subsequently implies that

$$\begin{aligned} F(x_{k+1}) - F^* \le \left( 1 - \frac{\mu }{ L}\right) [ F(x_{k}) - F^*], \end{aligned}$$
(15)

which applied recursively gives the result.    \(\square \)

We note that the condition \(\mu \le L\) is implicit in the definition of the proximal-PL inequality, but this is not restrictive since we can simply set \(\mu \) to a smaller value to satisfy this. While other conditions have been proposed to show linear convergence rates of proximal-gradient methods without SC [17, 44], their analyses tend to be much more complicated than the above while, as we discuss in the next section, the proximal-PL inequality includes the standard scenarios where these apply.

4.1 Relevant Problems

As with the PL inequality, we now list several important function classes that satisfy the proximal-PL inequality (12). We give proofs that these classes satisfy the inequality in Appendices 4.1, 4.2, and 4.4.

  1. 1.

    The inequality is satisfied if f satisfies the PL inequality and g is constant. Thus, the above result generalizes Theorem 1.

  2. 2.

    The inequality is satisfied if f is SC. This is the usual assumption used to show a linear convergence rate for the proximal-gradient algorithm [34], although we note that the above analysis is much simpler than standard arguments.

  3. 3.

    The inequality is satisfied if f has the form \(f(x) = h(Ax)\) for a SC function h and a matrix A, while g is an indicator function for a polyhedral set.

  4. 4.

    The inequality is satisfied if F is convex and satisfies the QG property. In Appendices 4.2 and 4.4 we show that L1-regularized least squares and the support vector machine dual (respectively) fall into this category, and we discuss these two notable cases further below.

We expect that it is possible to show the proximal-PL inequality holds in other cases where the proximal-gradient achieves a linear convergence rate like the case of group L1-regularization [40] and nuclear-norm regularization [14].

4.2 Least Squares with L1-Regularization

Perhaps the most interesting example of problem (11) is the \(\ell _1\)-regularized least squares problem,

$$\begin{aligned} \mathop {\hbox {argmin}}_{x \in \mathrm{I\!R}^d}~\! \frac{1}{2}\Vert Ax - b\Vert ^2 + \lambda \Vert x\Vert _1, \end{aligned}$$

where \(\lambda > 0\) is the regularization parameter. This problem has been studied extensively in machine learning, signal processing, and statistics. This problem structure seems well-suited to using proximal-gradient methods, but the first works analyzing proximal-gradient methods for this problem only showed sub-linear convergence rates. There subsequently have been a variety of works showing that linear convergence rates can be achieved under additional assumptions. For example, Gu et al. [11] prove that their algorithm achieves a linear convergence rate if A satisfies a restricted isometry property (RIP) and the solution is sufficiently sparse. Xiao and Zhang [43] also assume the RIP property and show linear convergence using a homotopy method that slowly decreases the value of \(\lambda \). Agarwal et al. [1] give a linear convergence rate under a modified restricted strong convexity and modified restricted smoothness assumption. In Appendix 4.2 we show that any L1-regularized least squares problem satisfies the QG property if we use a descent method and thus by convexity also satisfies the proximal-PL inequality. Thus, Theorem 5 implies a global linear convergence rate for these problems without making additional assumptions or making any modifications to the algorithm. A similar result recently appeared in the work of Necoara and Clipici [26] under a generalized EB, but with a much more complicated analysis.

4.3 Proximal Coordinate Descent

It is also possible to adapt our results on coordinate descent and proximal-gradient methods in order to give a linear convergence rate for coordinate-wise proximal-gradient methods for problem (11). To do this, we require the extra assumption that g is a separable function. This means that \(g(x) = \sum _i g_i(x_i)\) for a set of univariate functions \(g_i\). The update rule for the coordinate-wise proximal-gradient method is

$$\begin{aligned} x_{k+1}= & {} \mathop {\hbox {argmin}}_{\alpha }~\! \left[ \alpha \nabla _{i_k} f(x_k) + \frac{L }{ 2}\alpha ^2 + g_{i_k}(x_{i_k} + \alpha ) - g_{i_k}(x_{i_k}) \right] , \end{aligned}$$
(16)

We state the convergence rate result below.

Theorem 6

Assume the setup of Theorem 5 and that g is a separable function \(g(x) = \sum _i g_i(x_i)\), where each \(g_i\) is convex. Then the coordinate-wise proximal-gradient update rule (16) achieves a convergence rate

$$\begin{aligned} \mathbb {E} [F(x_{k}) - F^*] \le \left( 1 - \frac{\mu }{ dL}\right) ^k[F(x_0) - F^*], \end{aligned}$$
(17)

when \(i_k\) is selected uniformly at random.

The proof is given in Appendix 4.3 and although it is more complicated than the proofs of Theorems 4 and 5, it is still simpler than existing proofs for proximal coordinate descent under SC [39]. It is also possible to analyze stochastic proximal-gradient algorithms, and indeed Reddi et al. use the proximal-PL inequality to analyze finite-sum methods in the proximal stochastic case [38].

4.4 Support Vector Machines

Another important model problem that arises in machine learning is support vector machines,

$$\begin{aligned} \mathop {\hbox {argmin}}_{x\in \mathrm{I\!R}^d}~\! \frac{\lambda }{ 2} x^T x + \sum _{i=1}^n \max (0, 1-b_i x^Ta_i). \end{aligned}$$
(18)

where \((a_i, b_i)\) are the labelled training set with \(a_i \in \mathbb {R}^d\) and \(b_i \in \{-1, 1\}\). We often solve this problem by performing coordinate optimization on its dual, which has the form

$$\begin{aligned} \min _{\bar{w}} f(\bar{w}) = \frac{1 }{ 2} \bar{w}^T M \bar{w} - \sum \bar{w}_i, \quad \bar{w}_i \in [0,U], \end{aligned}$$
(19)

for a particular matrix M and constant U. This function satisfies the QG property and thus Theorem 6 implies that coordinate optimization achieves a linear convergence rate in terms of optimizing the dual objective. Further, since Hush et al. [15] show that we can obtain an \(\epsilon \)-accurate solution to the primal problem with an \(O(\epsilon ^2)\)-accurate solution to the dual problem, this also implies a linear convergence rate for stochastic dual coordinate ascent on the primal problem. Global linear convergence rates for SVMs have also been shown by others [23, 41, 42], but again we note that these works lead to much more complicated analyses. Although the constants in these convergence rate may be quite bad (depending on the smallest non-zero singular value of the Gram matrix), we note that the existing sublinear rates still apply in the early iterations while, as the algorithm begins to identify support vectors, the constants improve (depending on the smallest non-zero singular value of the block of the Gram matrix corresponding to the support vectors).

The result of the previous section is not only restricted to SVMs. Indeed, the result of the previous section implies a linear convergence rate for many \(\ell _2\)-regularized linear prediction problems, the framework considered in the stochastic dual coordinate ascent (SDCA) work of Shalev-Shwartz and Zhang [35]. While Shalev-Shwartz and Zhang [35] show that this is true when the primal is smooth, our result gives linear rates in many cases where the primal is non-smooth.

5 Discussion

We believe that this work provides a unifying and simplifying view of a variety of optimization and convergence rate issues in machine learning. Indeed, we have shown that many of the assumptions used to achieve linear convergence rates can be replaced by the PL inequality and its proximal generalization. Throughout the paper, we have also pointed out how our analysis implies new convergence rates for a variety of machine learning models and algorithms. Some of these were previously known, typically under stronger assumptions or with more complicated proofs, but many of these are novel. Note that we have not provided any experimental results in this work, since the main contributions of this work are showing that existing algorithms actually work better on standard problems than we previously thought. We expect that going forward, efficiency will no longer be decided by the issue of whether functions are SC, but rather by whether they satisfy a variant of the PL inequality.