Abstract
In the previous paper Bello-Cruz et al. (J Optim Theory Appl 188:378–401, 2021), we showed that the quadratic growth condition plays a key role in obtaining Q-linear convergence of the widely used forward–backward splitting method with Beck–Teboulle’s line search. In this paper, we analyze the property of quadratic growth condition via second-order variational analysis for various structured optimization problems that arise in machine learning and signal processing. This includes, for example, the Poisson linear inverse problem as well as the \(\ell _1\)-regularized optimization problems. As a by-product of this approach, we also obtain several full characterizations for the uniqueness of optimal solution to Lasso problem, which complements and extends recent important results in this direction.
Similar content being viewed by others
1 Introduction
This paper is a continuation of our previous work [8] at which we studied convergence properties of the forward–backward splitting method (FBS in brief, also known as the proximal gradient method). The FBS [5, 7, 12, 14, 15, 28, 37] is a simple and efficient method for solving an optimization problem whose objective function is the sum of two convex functions: one of which is differentiable in its domain, and the other one is proximal-friendly (that is, its proximal mapping can be easily computed) and can be non-differentiable. It is well known that FBS is globally convergent to an optimal solution with the complexity \(o(k^{-1})\) [7, 9, 19, 40] in general settings. Linear convergence for FBS has been studied in many papers via Kurdya–Łojasiewicz inequality [10, 25, 26, 31, 49] or error bound conditions [21, 38, 41, 53] with the base from [34]. Without assuming the usual condition that the gradient of the differentiable function involved is globally Lipschitz continuous, our previous paper [8] studied convergence properties and the complexity of FBS method with Beck–Teboulle’s line search. In particular, under the so-called quadratic growth condition also known as 2-conditioned property, which is close to the idea in [21, 25, 26], we showed that the sequence generated by the FBS with Beck–Teboulle’s line search is Q-linearly convergent. Our derived linear rates complement and sometimes improve those in [21, 25, 26].
One of the main aims of this paper is to analyze the quadratic growth condition for several structured optimization problems. This allows us to understand the performance of FBS methods for solving specific optimization problems by considering the specific structure of the problems. In particular, we show that the quadratic growth condition is automatically satisfied for the standard Poisson inverse regularized problems with Kullback–Leibler divergence [16, 47], which does not satisfy the usual global Lipschitz continuous assumption mentioned above. Using FBS to solve Poisson inverse regularized problems was first proposed in [4] via the idea of Bregman divergence. Recently, Salzo [40] proved that the FBS method with an appropriate line search enjoys a complexity of \(o(k^{-1})\) when it is applied to solve Poisson inverse regularized problems. In this paper, we advance this direction by showing that the convergence rate of FBS method with Beck–Teboulle’s line search is indeed Q-linear in solving Poisson inverse regularized problems.
It is worth noting that linear convergence of the sequence generated by the FBS in solving some structured optimization problems was also studied in [12, 28, 32, 33, 41] when the nonsmooth function is partly smooth relative to a manifold by using the idea of finite support identification. The latter notion introduced by Lewis [29] allows Liang, Fadili, and Peyré [32, 33] to cover many important problems such as the total variation semi-norm, the \(\ell _1\)-norm, the \(\ell _\infty \)-norm, and the nuclear norm problems. In their paper, a second-order condition was introduced to guarantee the Q-local linear convergence of the FBS sequence under the non-degeneracy assumption [29]. When considering the \(\ell _1\)-regularized problem, we are able to avoid the non-degeneracy assumption. Under the setting of [8], this allows us to improve the well-known work of Hale, Yin, and Zhang [28] in two aspects: (a) We completely drop the aforementioned non-degeneracy assumption. (b) Our second-order condition is strictly weaker than the one in [28, Theorem 4.10]. The wider view is that when considering particular optimization problems listed in the spirit of [32, 33], the assumption of non-degeneracy may not be necessary. Furthermore, we revisit the iterative shrinkage thresholding algorithm (ISTA) [7, 18], which is indeed FBS for solving Lasso problem [42]. It is well known that the complexity of this algorithm is \(\mathcal {O}(k^{-1})\); however, recent works [32, 41] indicate the local linear convergence of ISTA. The stronger conclusion in this direction is obtained lately by Bolte, Nguyen, Peypouquet, and Suter [10, 25] that: the iterative sequence of ISTA is R-linearly convergent, and its corresponding cost sequence is globally Q-linearly convergent, but the rate may depend on the initial point. Inspired by these achievements, we provide two new information under the setting of [8]: (c) The iterative sequence of ISTA is indeed globally Q-linearly convergent. (d) The iterative sequence of ISTA is eventually Q-linearly convergent to an optimal solution with a uniform rate that does not depend on the initial point.
In order to obtain the linear convergence of ISTA, several papers make the assumption that the optimal solution to Lasso is unique; see, e.g., [12, 24, 28, 41]. Although solution uniqueness is not necessary, as discussed above, it is an important property with immediate implications for recovering sparse signals in compressed sensing; see, e.g., [13, 23, 24, 27, 36, 43, 44, 48, 50, 51] and the references therein. As a direct consequence of our analysis on the \(\ell _1\)-regularized problem, we fully characterize solution uniqueness to Lasso. To the best of our knowledge, Fuchs [23] initialized this direction by introducing a simple sufficient condition for this property, which has been extended in other cited papers. Then, in [43], Tibshirani showed that a sufficient condition closely related to Fuchs’ condition is also necessary almost everywhere. The full characterization for this property has been obtained recently in [50, 51] by using results of strong duality in linear programming. This characterization, which is based on an existence of a vector satisfying a system of linear equations and inequalities, allows [50, 51] to recover the aforementioned sufficient conditions and provide some situations in which these conditions turn necessary. Some related results have been developed in [27, 36]. Our approach to solution uniqueness is new and different. We also derive several new full characterizations in terms of positively linear independence and Slater-type conditions, which can be easily verifiable.
The outline of our paper is as follows. Section 2 briefly presents some second-order characterization for quadratic growth condition in terms of subgradient graphical derivative [39] and recalls some convergence analysis from our part I [8]. Section 3 devotes to the study of the quadratic growth condition in some structured optimization problems involving Poisson inverse regularized, \(\ell _1\)-regularized, and \(\ell _1\)-regularized least square optimization problems. In Sect. 4, we obtain several new full characterizations to the uniqueness of optimal solution to Lasso problem. Section 5 gives some conclusions and potential future works in this direction.
2 Preliminary Results on Metric Subregularity of the Subdifferential and Quadratic Growth Condition
Throughout the paper, \(\mathbb {R}^n\) is the usual Euclidean space with dimension n where \(\Vert \cdot \Vert \) and \(\langle \cdot , \cdot \rangle \) denote the corresponding Euclidean norm and inner product in \(\mathbb {R}^n\). We use \(\Gamma _0(\mathbb {R}^n)\) to denote the set of proper, lower semicontinuous, and convex functions on \(\mathbb {R}^n\). Let \(h\in \Gamma _0(\mathbb {R}^n)\), we write \(\mathrm{dom\,}h:=\{x\in \mathbb {R}^n\,|\; h(x)<+\infty \}\). The subdifferential of h at \({{\bar{x}}}\in \mathrm{dom\,}h\) is defined by
We say h satisfies the quadratic growth condition at \({{\bar{x}}}\) with modulus \(\kappa >0\) if there exist \(\varepsilon >0\) such that
Here, for a set S, d(x; S) denotes the distance from x to S, and \(\mathbb {B}_{\epsilon }(\bar{x})\) denotes the ball centered at \(\bar{x}\) with radius \(\epsilon \). Moreover, if (2) and \((\partial h)^{-1}(0)=\{{{\bar{x}}}\}\) are both satisfied, then we say strong quadratic growth condition holds for h at \({{\bar{x}}}\) with modulus \(\kappa \).
Some relationship between the quadratic growth condition and the so-called metric subregularity of the subdifferential could be found in [1,2,3, 10, 22] even for the case of nonconvex functions. The quadratic growth condition (2) is also called quadratic functional growth property in [38] when h is continuously differentiable over a closed convex set. In [25, 26], h is said to be 2-conditioned on \(\mathbb {B}_\varepsilon ({{\bar{x}}})\) if it satisfies the quadratic growth condition (2).
The following proposition, a slight improvement in [2, Corollary 3.7], provides a useful characterization for strong quadratic growth condition via the subgradient graphical derivative [39, Chapter 13].
Proposition 2.1
(Characterization of strong quadratic growth condition) Let \(h\in \Gamma _0(\mathbb {R}^n)\) and \({{\bar{x}}}\) be an optimal solution, i.e., \(0\in \partial h({{\bar{x}}})\). The following are equivalent:
-
(i)
h satisfies the strong quadratic growth condition at \({{\bar{x}}}\).
-
(ii)
\(D(\partial h)({{\bar{x}}}|0)\) is positive-definite in the sense that
$$\begin{aligned} \langle v,u\rangle >0\quad \text{ for } \text{ all }\quad v\in D(\partial h)({{\bar{x}}}|0)(u), u\in \mathbb {R}^n, u\ne 0, \end{aligned}$$(3)where \(D(\partial h)({{\bar{x}}}|0):\mathbb {R}^n \rightrightarrows \mathbb {R}^n\) is the subgradient graphical derivative of \(\partial h\) at \({{\bar{x}}}\) for 0 defined by
$$\begin{aligned}&D(\partial h)({{\bar{x}}}|0)(u):=\{v\in \mathbb {R}^n|\; \exists (u_n,v_n)\rightarrow (u,v), t_n\downarrow 0 \\&\quad \text{ such } \text{ that } t_nv_n\in \partial h({{\bar{x}}}+t_nu_n)\} \text{ for } \text{ any } u\in \mathbb {R}^n. \end{aligned}$$
Moreover, if (ii) is satisfied then
with convention \(\dfrac{0}{0}=\infty \) and h satisfies the strong quadratic growth condition at \({{\bar{x}}}\) with any modulus \(\kappa <\ell \).
Proof
The implication [(i)\(\Rightarrow \) (ii)] follows from [2, Theorem 3.6 and Corollary 3.7]. If (ii) is satisfied, we obtain from (3) that \(\Vert v\Vert \ge \ell \Vert u\Vert \). Combining [20, Theorem 4C.1] and [22, Corollary 3.3] tells us that h satisfies the strong quadratic growth condition at \({{\bar{x}}}\) with any modulus \(\kappa <\ell \). The proof is complete. \(\square \)
Next, let us recall here some main results from our part I [8] regarding the convergence of forward–backward splitting method (FBS) for solving the following optimization problem:
where \(f,g:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) are proper, lower semi-continuous, and convex functions.Footnote 1 The standing assumptions on the initial data for (5) used throughout the paper:
- A1:
-
\(f, g\in \Gamma _0(\mathbb {R}^n)\) and \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\ne \emptyset \).
- A2:
-
f is continuously differentiable at any point in \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\)
- A3:
-
For any \(x\in \mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\), the sublevel set \(\{F\le F(x)\}\) is contained in \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g \).
The forward–backward splitting methods for solving (5) is described by
with the proximal operator \(\mathrm{prox}_{g}:\mathbb {R}^n\rightarrow \mathrm{dom\,}g\) given by
and the stepsize \(\alpha _k>0\) determined from the Beck–Teboulle’s line search as follows:
In [8, Proposition 3.1 and Corollary 3.1], we show that the linesearch above terminates after finite steps, the FBS sequence \((x^k)_{k\in \mathbb {N}}\subset \mathrm{int}\, (\mathrm{dom\,}f)\cap \mathrm{dom\,}g\) is well defined, and thus f is differentiable at \(x^k\) by assumption A2. The global convergence [8, Theorem 3.1] is recalled here.
Theorem 2.1
(Global convergence of FBS method) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated from FBS method. Suppose that the solution set is not empty. Then, \((x^k)_{k\in \mathbb {N}}\) converges to an optimal solution point. Moreover, \( (F(x^k))_{k\in \mathbb {N}}\) also converges to the optimal value.
When the cost function F satisfies the quadratic growth condition and \(\nabla f\) is locally Lipschitz continuous, our [8, Theorem 4.1] shows that both iterative and cost sequences of FBS are Q-linearly convergent.
Theorem 2.2
(Q-linear convergence under quadratic growth condition) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated from FBS method. Suppose that the optimal solution set \( S^*\) to problem (5) is nonempty, and let \(x^*\in S^*\) be the limit point of \((x^k)_{k\in \mathbb {N}}\). Suppose further that \(\nabla f\) is locally Lipschitz continuous around \(x^*\) with constant \(L>0\). If F satisfies the quadratic growth condition at \(x^*\) with modulus \(\kappa >0\), there exists \(K\in \mathbb {N}\) such that
for any \(k>K\), where \(\alpha :=\min \big \{\alpha _K,\frac{\theta }{L}\big \}\).
If, in addition, \(\nabla f\) is globally Lipschitz continuous on \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\) with constant \(L>0\), \(\alpha \) could be chosen as \(\min \big \{\sigma ,\frac{\theta }{L}\big \}\).
Under the strong quadratic growth condition, a sharper rate is obtained in [8, Corollary 4.1].
Corollary 2.1
(Sharper Q-linear convergence rate under strong quadratic growth condition) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated from FBS method. Suppose that the solution set \(S^*\) is not empty, and let \(x^*\in S^*\) be the limit point of \((x^k)_{k\in \mathbb {N}}\) as in Theorem 2.1. Suppose further that \(\nabla f\) is locally Lipschitz continuous around \(x^*\) with constant \(L>0\). If F satisfies the strong quadratic growth condition at \(x^*\) with modulus \(\kappa >0\), then there exists some \(K\in \mathbb {N}\) such that for any \(k>K\) we have
Additionally, if \(\nabla f\) is globally Lipschitz continuous on \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\) with constant \(L>0\), \(\alpha \) above could be chosen as \(\min \big \{\sigma ,\frac{\theta }{L}\big \}\).
3 Quadratic Growth Conditions and Linear Convergence of Forward–Backward Splitting Method in Some Structured Optimization Problems
In this section, we mainly show that quadratic growth condition is automatic or can be fulfilled under mild assumptions in several important classes of convex optimization.
3.1 Poisson Linear Inverse Problem
This subsection devotes to the study of the eventually linear convergence of FBS when solving the following standard Poisson regularized problem [16, 47]
where \(A\in \mathbb {R}^{m\times n}_+\) is an \(m\times n\) matrix with nonnegative entries and nontrivial rows, and \(b\in \mathbb {R}^m_{++}\) is a positive vector. This problem is usually used to recover a signal \(x\in \mathbb {R}^n_+\) from the measurement b corrupted by Poisson noise satisfying \(Ax\simeq b\). The problem (10) could be written in terms of (5) in which
where h is the Kullback–Leibler divergence defined by
Note from (11) and (12) that \(\mathrm{dom\,}f=A^{-1}(\mathbb {R}^m_{++})\), which is an open set. Moreover, since \(A\in \mathbb {R}^{m\times n}_+\), we have \(\mathrm{dom\,}f\cap \mathrm{dom\,}g=A^{-1}(\mathbb {R}^m_{++})\cap \mathbb {R}^n_+\ne \emptyset \) and f is continuously differentiable at any point on \( \mathrm{dom\,}f\cap \mathrm{dom\,}g\). The standing assumptions A1 and A2 are satisfied for Problem (10). Moreover, since the function \(F_1\) is bounded below and coercive, the optimal solution set of problem (10) is always nonempty.
It is worth noting further that \(\nabla f\) is locally Lipschitz continuous at any point \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\) but not globally Lipschitz continuous on \(\mathrm{int}(\mathrm{dom\,}f)\cap \mathrm{dom\,}g\). [40, Section 4] and our [8, Theorem 3.2] show that FBS is applicable to solving (10) with global convergence rate \(o(\frac{1}{k})\). In the recent work [4], a new algorithm on a variant of FBS was designed with applications to solving (10). However, the theory developed in [4] could not guarantee the global convergence of the sequence \((x^k)_{k\in \mathbb {N}}\) generated by the algorithm in solving (10). This is because their assumptions on the closedness of the domain of their auxiliary Legendre function in [4, Theorem 2] are not satisfied for (10). Our intent here is to reveal the Q-linear convergence of our method when solving (10) in the sense of Theorem 2.2. In order to do so, we need to verify the quadratic growth condition for \( F_1\) at any optimal minimizer for 0. Note further that the Kullback–Leibler divergence h is not strongly convex and \(\nabla f\) is not globally Lipschitz continuous; hence, standing assumptions in [21] are not satisfied. Proving the quadratic growth condition for \( F_1\) at an optimal solution via the approach of [21] needs to be proceeded with caution.
Lemma 3.1
Let \({{\bar{x}}}\) be an optimal solution to problem (10). Then, for any \(R>0\), we have
with some constant \(\nu >0\).
Proof
Pick any \(R>0\) and \(x\in \mathbb {B}_R({{\bar{x}}})\). We only need to prove (13) for the case that \(x\in \mathrm{dom\,}F_1\cap \mathbb {B}_R({{\bar{x}}})\), i.e., \(x\in A^{-1}(\mathbb {R}^n_{++})\cap \mathbb {R}^n_+\cap \mathbb {B}_R({{\bar{x}}})\). Note that
where \(a_i\) is the ith row of A. Define \({{\bar{y}}}:=A{{\bar{x}}}\), for any \(x,u\in \mathbb {B}_R({{\bar{x}}})\cap \mathrm{dom\,}f\) we have \([x,u]\subset \mathbb {B}_R({{\bar{x}}})\cap \mathrm{dom\,}f\) and obtain from the mean-value theorem that
Similarly, we have
Adding the above two inequalities gives us that
We claim that the optimal solution set \(S^*\) to problem (10) satisfies that
Pick another optimal solution \({{\bar{u}}}\in S^*\), we have \({{\bar{u}}}_t:={{\bar{x}}}+t({{\bar{x}}}-{{\bar{u}}})\in S^*\subset \mathrm{dom\,}f\) for any \(t\in [0,1]\) due to the convexity of \(S^*\). By choosing t sufficiently small, we have \({{\bar{u}}}_t\in \mathbb {B}_R({{\bar{x}}})\cap \mathrm{dom\,}f\). Note further that \(-\nabla f({{\bar{u}}}_t)\in \partial g({{\bar{u}}}_t)\) and \(-\nabla f({{\bar{x}}})\in \partial g({{\bar{x}}})\). Since \(\partial g\) is a monotone operator, we obtain that
This together with (15) tells us that \(\langle a_i, {{\bar{x}}}-{{\bar{u}}}_t\rangle =0\) for all \(i=1,\ldots ,m\). Hence, \(A{{\bar{x}}}=A{{\bar{u}}}={{\bar{y}}}\) for any \({{\bar{u}}}\in S^*\), which also implies that
This verifies the inclusion “\(\subset \)” in (16). The opposite inclusion is trivial. Indeed, take any u satisfying that \(Au={{\bar{y}}}\) and \(-\nabla f({{\bar{x}}})\in \partial g(u)\), similarly to (17) we have \(-\nabla f(u)=-\nabla f({{\bar{x}}})\in \partial g(u)\). This shows that \(0\in \nabla f(u)+\partial g(u)\), i.e., \(u\in S^*\). The proof for equality (16) is completed.
Note from (16) that the optimal solution set \(S^*\) is a polyhedral with the following format
due to the fact that \((\partial g)^{-1}(-\nabla f({{\bar{x}}}))=\{u\in \mathbb {R}^n_+\,|\; \langle \nabla f({{\bar{x}}}),u\rangle =0= \langle \nabla f({{\bar{x}}}),{{\bar{x}}}\rangle \}. \) Thanks to Hoffman’s lemma, there exists a constant \(\gamma >0\) such that
Fix any \(x\in \mathbb {B}_R({{\bar{x}}})\cap \mathbb {R}^n_+\), (14) tells us that
Where \(\alpha = \min \limits _{1\le i\le m}\Big [\frac{b_i}{[|\langle a_i,{{\bar{x}}}\rangle |+3\Vert a_i\Vert R]^2}\Big ]\). Since \(- \nabla f({{\bar{x}}})\in \partial g({{\bar{x}}})\), we have \(\langle \nabla f({{\bar{x}}}),x-{{\bar{x}}}\rangle \ge 0\). This together with (19) implies that
where the fourth inequality follows from the elementary inequality that \(\frac{(a+b)^2}{2} \le a^2+b^2\) with \(a,b \ge 0\), and the last inequality is from (18). This clearly ensures (13). \(\square \)
When applying FBS to solving problem (10), we have
where \(\mathbb {P}_{\mathbb {R}^n_+}(\cdot )\) is the projection mapping to \(\mathbb {R}^n_+\).
Corollary 3.1
(Q-linear convergence of method (20)) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated from (20) with \(x^0\in A^{-1}(\mathbb {R}^n_+)\cap \mathbb {R}^n_+\) for solving the Poisson regularized problem (10). Then, the sequences \((x^k)_{k\in \mathbb {N}}\) and \((F_1(x^k))_{k\in \mathbb {N}}\) are Q-linearly convergent to an optimal solution and the optimal value to (10), respectively.
Proof
Since both functions f and g in problem (10) satisfy our standing assumptions A1 and A2, and problem (10) always has optimal solutions, the sequence \((x^k)_{k\in \mathbb {N}}\) converges to an optimal solution \({{\bar{x}}}\) to problem (10) by Theorem 2.1. Since \(\nabla f\) is locally Lipschitz continuous around \({{\bar{x}}}\), the combination of Theorem 2.2 and Lemma 3.1 tells us that \((x^k)_{k\in \mathbb {N}}\) is Q-linearly convergent to \({{\bar{x}}}\). \(\square \)
Using a similar line of argument as above, one can show that quadratic growth condition in Lemma 3.1 is also valid for the following Poisson inverse problem with sparse regularization [4]:
where \(\mu >0\) is the penalty parameter. Indeed, noting that \(\Vert x\Vert _1= \langle e, x \rangle \) for \(x \in \mathbb {R}^n_+\), with \(e=(1,1,\ldots ,1)\in \mathbb {R}^n\). The objective function of (21) can be written as \(p(x)+g(x)\) where \(p(x):=f(x)+\mu \langle e,x\rangle \), and f, g are given as in (11). Then, the FBS method for solving (21) can proceed by replacing the function f(x) in (11) by p(x). Let \({{\hat{x}}}\in \mathrm{dom\,}p=\mathrm{dom\,}f\) be a minimizer to (21). Observe that the function p also satisfies the similar inequality as in (14)
As (14) plays the central role in the proof of Lemma 3.1, we can repeat all the steps in this proof by replacing the function f there by p and \({{\bar{x}}}\) by \({{\hat{x}}}\) to prove the quadratic growth condition of problem (21). This together with Corollary 3.1 shows that FBS (10) solves (21) linearly.
3.2 \(\ell _1\)-Regularized Optimization Problems
In this section, we consider the \(\ell _1\)-regularized optimization problems
where \(\Vert x\Vert _1\) denotes the \(\ell _1\) -norm of x.
In order to use Proposition 2.1 for characterizing the strong quadratic growth condition for \(F_2\), we need the following calculation of subgradient graphical derivative of \(\partial (\mu \Vert \cdot \Vert _1)\).
Proposition 3.1
(Subgradient graphical derivative \(\partial (\mu \Vert \cdot \Vert _1)\)) Suppose that \({{\bar{s}}}\in \partial (\mu \Vert \cdot \Vert _1)(x^*)\). Define \(I:=\{j\in \{1,\ldots ,n\}\,|\; |{{\bar{s}}}_j|=\mu \}\), \(J:=\{j\in I \,|\;x^*_j\ne 0\}\), \(K:=\{j\in I\,|\; x^*_j=0\}\), and \(H(x^*):=\{u\in \mathbb {R}^n\,|\; u_j=0, j\notin I \text{ and } u_j{{\bar{s}}}_j\ge 0, j\in K\}\). Then, \(D\partial \mu \Vert \cdot \Vert _1(x^*|{{\bar{s}}})(u)\) is nonempty if and only if \(u\in H(x^*)\). Furthermore, we have
Proof
For any \(x\in \mathbb {R}^n\), note that
where \(\mathrm{sgn}:\mathbb {R}\rightarrow \{-1,1\}\) is the sign function. Take any \(v\in D\partial \Vert \cdot \Vert _1(x^*|{{\bar{s}}})(u)\), there exists sequence \(t^k\downarrow 0\) and \((u^k,v^k)\rightarrow (u,v)\) such that \((x^*,{{\bar{s}}})+t^k(u^k,v^k)\in \mathrm{gph}\,\partial \mu \Vert \cdot \Vert _1\). Let us consider three partitions of j described below:
Partition 1.1 \(j\notin I\), i.e., \(|{{\bar{s}}}_j|<\mu \). It follows from (24) that \(x^*_j=0\). For sufficiently large k, we have \(|({{\bar{s}}}+t^kv^k)_j|<\mu \) and thus \(|(x^*+t^ku^k)_j|=0\) by (24) again. Hence, \(u^k_j=0\), which implies that \(u_j=0\) for all \(j\notin I\).
Partition 1.2 \(j\in J\), i.e., \(|{{\bar{s}}}_j|=\mu \) and \(x^*_j\ne 0\). When k is sufficiently large, we have \((x^*+t^ku^k)_j\ne 0\) and derive from (24) that
which implies that \(v_j=0\) for all \(j\in J\).
Partition 1.3 \(j\in K\), i.e., \(|{{\bar{s}}}_j|=\mu \) and \(x^*_j=0\). If there is a subsequence of \((x^*,{{\bar{s}}})_j+t^k(u^k,v^k)_j\) (without relabeling) such that \(|({{\bar{s}}}+t^kv^k)_j|<\mu =|{{\bar{s}}}_j|\), we have \({{\bar{s}}}_jv^k_j< 0\) and \((x^*+t^ku^k)_j=0\) by (24). It follows that \(u^k_j=0\). Letting \(k\rightarrow \infty \), we have \(u_j=0\) and \({{\bar{s}}}_j v_j \le 0\). Otherwise, we find some \(L>0\) such that \(|({{\bar{s}}}+t^kv^k)_j|=\mu =|{{\bar{s}}}_j|\) for all \(k>L\), which yields \(v^k_j=0\). Taking \(k\rightarrow \infty \) gives us that \(v_j=0\). Furthermore, by (24) again, we have
which imply that \({{\bar{s}}}_ju_j\ge 0\) after passing the limit \(k\rightarrow \infty \).
Combining the conclusions in three cases above gives us that \(u\in H(x^*)\) and also verifies the inclusion “\(\subset \)” in (23). To justify the converse inclusion “\(\supset \)”, take \(u\in H(x^*)\) and any \(v\in \mathbb {R}^n\) with \(v_j=0\) for \(j\in J\) and \(u_jv_j=0, {{\bar{s}}}_jv_j\le 0\) for \(j\in K\). For any \(t^k\downarrow 0\), we prove that \((x^*,{{\bar{s}}})+t^k(u,v)\in \mathrm{gph}\,\partial \mu \Vert \cdot \Vert _1\) and thus verify that \(v\in D\partial \mu \Vert \cdot \Vert _1(x^*|{{\bar{s}}})(u)\). For any \(t\in \mathbb {R}\), define the set-valued mapping:
Similarly to the proof of “\(\subset \)” inclusion, we consider three partitions of j as follows:
Partition 2.1 \(j\notin I\), i.e., \(|{{\bar{s}}}_j|<\mu \). Since \(u\in H(x^*)\), we have \(u_j=0\). Note also that \(x^*_j=0\). Hence, we get \((x^*+t^ku)_j=0\) and \(({{\bar{s}}}+t^kv)_j\in [-\mu ,\mu ]\) when k is sufficiently large, which means \(({{\bar{s}}}+t^kv)_j\in \mu \, \mathrm{SGN}(x^*+t^ku)_j\).
Partition 2.2 \(j\in J\), i.e., \(|{{\bar{s}}}_j|=\mu \) and \(x^*_j\ne 0\). Since \(v_j=0\), we have
and \((x^*+t^ku)_j\ne 0\) when k is large. It follows that \(({{\bar{s}}}+t^kv)_j\in \mu \, \mathrm{SGN}(x^*+t^ku)_j\).
Partition 2.3 \(j\in K\), i.e., \(|{{\bar{s}}}_j|=\mu \) and \(x^*_j=0\). If \(u_j=0\), we have \((x^*+t^ku)_j=0\) and \(|({{\bar{s}}}+t^kv)_j|\le |{{\bar{s}}}_j|\le \mu \) for sufficiently large k, since \({{\bar{s}}}_jv_j\le 0\). If \(u_j\ne 0\), we have \(v_j=0\) and
when k is large, since \(u_j{{\bar{s}}}_j\ge 0\). In both cases, we have \(({{\bar{s}}}+t^kv)_j\in \mu \, \mathrm{SGN}(x^*+t^ku)_j\).
From those cases, we always have \((x^*,{{\bar{s}}})+t^k(u,v)\in \mathrm{gph}\,\partial \mu \Vert \cdot \Vert _1\) and thus \(v\in D\partial \mu \Vert \cdot \Vert _1(x^*|{{\bar{s}}})(u)\). \(\square \)
As a consequence, we establish a characterization of strong quadratic growth condition for \(F_2\).
Theorem 3.1
(Characterization of strong quadratic growth condition for \(F_2\)) Let \(x^*\) be an optimal solution to problem (22). Suppose that \(\nabla f\) is differentiable at \(x^*\). Define \(\mathcal {E}:=\big \{j\in \{1,\ldots ,n\}\,|\; |(\nabla f(x^*))_j|=\mu \big \}\), \(K:=\{j\in \mathcal {E}\,|\; x^*_j=0\}\), \(\mathcal {U}:=\{u\in \mathbb {R}^\mathcal {E}\,|\; u_j(\nabla f(x^*))_j\le 0, j\in K\}\) with \(\mathbb {R}^\mathcal {E}=\{u=(u_j)_{j\in \mathcal {E}}\in \mathbb {R}^\mathcal {|E|}\}\), and \(\mathcal {H}_\mathcal {E}(x^*):=[\nabla ^2 f(x^*)_{i,j}]_{i,j\in \mathcal {E}}\). Then, the following statements are equivalent:
-
(i)
\(F_2\) satisfies the strong quadratic growth condition at \(x^*\).
-
(ii)
\(\mathcal {H}_\mathcal {E}(x^*)\) is positive definite over \(\mathcal {U}\) in the sense that
$$\begin{aligned} \langle \mathcal {H}_\mathcal {E}(x^*) u,u\rangle >0\qquad \text{ for } \text{ all }\quad u\in \mathcal {U}\setminus \{0\}. \end{aligned}$$(25) -
(iii)
\(\mathcal {H}_\mathcal {E}(x^*)\) is nonsingular over \(\mathcal {U}\) in the sense that
$$\begin{aligned} \ker \mathcal {H}_\mathcal {E}(x^*)\cap \mathcal {U}=\{0\}. \end{aligned}$$(26)
Moreover, if (25) is satisfied, then \( F_2\) satisfies the strong quadratic growth condition with any positive modulus \(\kappa <\ell \) with
with the convention \(\frac{0}{0}=\infty \).
Proof
First let us verify the equivalence between (i) and (ii) by using Proposition 2.1. Indeed, for any \(v\in D(\partial F_2)(x^*|0)(u)\) we have get from the sum rule [20, Proposition 4A.2] that
Define \(\mathcal {V}:=\{u\in \mathbb {R}^n|\; u_j=0, j\notin \mathcal {E},u_j(\nabla f(x^*))_j\le 0, j\in K\}\). Thanks to Proposition 3.1, we have
This tells us that (25) is the same with (3) when \(h=F_2\). By Proposition 2.1, (i) and (ii) are equivalent. Moreover, \(F_2\) satisfies the strong quadratic growth condition with any positive modulus \(\kappa <\ell \).
Finally, the equivalence between (ii) and (iii) is trivial due to the fact that f is convex and thus \(\mathcal {H}_\mathcal {E}(x^*)\) is positive semi-definite. \(\square \)
Corollary 3.2
(Linear convergence of FBS method for \(\ell _1\)-regularized problems) Let \((x^k)_{k\in \mathbb {N}}\) be the sequences generated from FBS method for problem (22). Suppose that the solution set \(S^*\) is not empty, \((x^k)_{k\in \mathbb {N}}\) is converging to some \(x^*\in S^*\), and that f is \(\mathcal {C}^2\) around \(x^*\). If condition (25) holds, then \((x^k)_{k\in \mathbb {N}}\) and \((F_2(x^k))_{k\in \mathbb {N}}\) are Q-linearly convergent to \(x^*\) and \(F_2(x^*)\), respectively, with rates determined in Corollary 2.1, where \(\kappa \) is any positive number smaller than \(\ell \) in (27).
Proof
Since f is \(\mathcal {C}^2\) around \(x^*\), \(\nabla f\) is locally Lipschitz continuous around \(x^*\). The result follows from Corollary 2.1 and Theorem 3.1. \(\square \)
Remark 3.1
It is worth noting that condition (26) is strictly weaker than the assumption used in [28] that \(H_{\mathcal {E}}(x^*)\) has full rank to obtain the linear convergence of FBS for (22). Indeed, let us take into account the case \(n=2\), \(\mu =1\), and \(f(x_1,x_2)=\frac{1}{2}(x_1+x_2)^2+x_1+x_2\). Note that \(x^*=(0,0)\) is an optimal solution to problem (22). Moreover, direct computation gives us that \(\nabla f(x^*)=(1,1)\), \(\mathcal {E}=\{1,2\}\), and \(H_\mathcal {E}(x^*)=\begin{pmatrix} 1 &{} 1\\ 1 &{}1\end{pmatrix}\). It is clear that \(H_\mathcal {E}(x^*)\) does not have full rank, but condition (25) and its equivalence (26) hold.
3.3 Global Q-linear Convergence of ISTA on Lasso Problem
In this section, we study the linear convergence of ISTA for Lasso problem
where A is an \(m\times n\) real matrix and b is a vector in \(\mathbb {R}^m\).
The following lemma taken from [10, Lemma 10] plays an important role in our proof.
Lemma 3.2
(Global error bound) Fix any \(R>\frac{\Vert b\Vert ^2}{2\mu }\). Suppose that \(x^*\) is an optimal solution to problem (29). Then, we have
where
while \(\nu \) is the Hoffman constant defined in [10, Definition 1] and only depends on the initial data \(A, b, \mu \).
Global R-linear convergence of \((x^k)_{k\in \mathbb {N}}\) from ISTA and Q-linear convergence of \((F_3(x^k))_{k\in \mathbb {N}}\) for solving Lasso problem were obtained in [25, Theorem 4.2 and Remark 4.3] and also [26, Theorem 4.8]. Here, we add another feature: The iterative sequence \((x^k)_{k\in \mathbb {N}}\) is also globally Q-linearly convergent.
Theorem 3.2
(Global Q-linear convergence of ISTA) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated by ISTA for problem (29) that converges to an optimal solution \(x^*\in S^*\). Then, \((x^k)_{k\in \mathbb {N}}\) and \((F_3(x^k))_{k\in \mathbb {N}}\) are globally Q-linearly convergent to \(x^*\) and \(F_3(x^*)\), respectively:
for all \(k\in \mathbb {N}\), where R is any number bigger than \(\Vert x^0\Vert +\frac{\Vert b\Vert ^2}{\mu }\) and \(\gamma _R\) is given as in (30) while \(\alpha :=\min \left\{ \sigma ,\frac{\theta }{\lambda _\mathrm{max}(A^TA)}\right\} \).
Proof
Note that Lasso always has optimal solutions. With \(x^*\in S^*\), we have
which implies that \(\Vert x^*\Vert \le \Vert x^*\Vert _1\le \frac{1}{2\mu }\Vert b\Vert ^2\). It follows from [8, Corollary 3.1] that
for all \(k\in \mathbb {N}\). Thanks to Lemma 3.2, [8, Corollary 3.1], and [8, Proposition 3.2], we have
with \(\alpha =\min \left\{ \sigma ,\frac{\theta }{\lambda _\mathrm{max}(A^TA)}\right\} \) and the note that \(\lambda _{\mathrm{max}}(A^TA)\) is the global Lipschitz constant of the gradient of \(\frac{1}{2}\Vert Ax-b\Vert ^2\). The proof of (31) and (32) is quite similar to the one of (8) and (9) in Theorem 2.2; see [8, Theorem 4.1] for further details. \(\square \)
Remark 3.2
(Linear convergence rate comparisons for ISTA) In this remark, we provide some comparisons for the derived linear convergence rate for ISTA with the existing results in the literature.
-
For the sequence \((x^k)_{k\in \mathbb {N}}\) generated by ISTA, we first note that our derived Q-linear convergence in Theorem 3.2 is \(\dfrac{1}{\sqrt{1+\frac{\gamma _R}{4 \lambda _{\mathrm{max}}(A^TA)}}}\) according to (31). This result is new to the literature. In [10, Theorem 25 and Remark 26], R-linear convergence for this sequence via \(\gamma _R\) was obtained. In the case of constant step size, by setting \(\sigma =\dfrac{1}{\lambda _{\mathrm{max}}(A^TA)}\) and \(\theta =1\), we have \(\alpha _k=\alpha =\sigma \); see [8, Remark 4.1]. In this case, the corresponding R-linear convergence rate given in [10] reads as \(\dfrac{1}{\sqrt{1+\frac{\gamma _R}{3\lambda _{\mathrm{max}}(A^TA)}}}\). On the other hand, using [8, Proposition 4.1(i)] and Lemma 3.2, one can deduce that the R-linear rate for \((x^k)_{k\in \mathbb {N}}\) is \(\dfrac{1}{\sqrt{1+\frac{\gamma _R}{\lambda _{\mathrm{max}}(A^TA)}}}\) , which is sharper than the rate \(\dfrac{1}{\sqrt{1+\frac{\gamma _R}{3\lambda _{\mathrm{max}}(A^TA)}}}\) given in [10].
-
For the R-linear convergence rate for \((F_3(x^k))_{k\in \mathbb {N}}\), from [8, Proposition 4.1(ii)] and Lemma 3.2, one can deduce that the rate is \(\dfrac{1}{1+\frac{\gamma _R}{\lambda _{\mathrm{max}}(A^TA)}}\). This rate is sharper than the one \(\dfrac{1}{1+\frac{\gamma _R}{3\lambda _\mathrm{max}(A^TA)}}\) derived in [10, Remark 26]. However, the Q-linear rate \(\dfrac{1}{1+\frac{\gamma _R}{4 \lambda _\mathrm{max}(A^TA)}}\) for \((F_3(x^k))_{k\in \mathbb {N}}\) obtained by combining [25, Theorem 4.2(iii)] and Lemma 3.2 is better than our rate given in (32); see also [8, Remark 4.1] for related comparisons. How to improve the Q-linear convergence rate for \((F_3(x^k))_{k\in \mathbb {N}}\) is an interesting future direction of research.
Observe further that the linear rates in Theorem 3.2 depend on the initial point \(x^0\); see also [26, Theorem 4.8]. Next, we show that the local linear rates around optimal solutions are uniform and independent of the choice of \(x^0\).
Corollary 3.3
(Local Q-linear convergence of ISTA with uniform rate) Let \((x^k)_{k\in \mathbb {N}}\) be the sequence generated by ISTA for problem (29) that converges to an optimal solution \(x^*\in S^*\). Then, (31) and (32) are satisfied when k is sufficiently large, where \(\alpha = \min \left\{ \sigma ,\frac{\theta }{\lambda _{\mathrm{max}}(A^TA)}\right\} \) and R is any number bigger than \(\frac{\Vert b\Vert ^2}{2\mu }\).
Proof
Note from the proof of Theorem 3.2 that \(\Vert x^*\Vert \le \frac{\Vert b\Vert ^2}{2\mu }<R\). By Lemma 3.2, there exists some \(\varepsilon \in (0,R-\Vert x^*\Vert )\) such that the quadratic growth condition holds at \(x^*\):
The corollary follows directly from the second part of Theorem 2.2. \(\square \)
3.4 Discussions on Nuclear Norm Regularized Least Square Optimization Problems
Another important optimization problem, which has received a lot of attention, is that so-called nuclear norm regularized least square optimization problem
Here, \(\mathcal {A}:\mathbb {R}^{p\times q}\rightarrow \mathbb {R}^{m\times n}\) is a linear operator, \(B \in \mathbb {R}^{m\times n}\), and \(\Vert X\Vert _*\) is the nuclear norm of X which is defined as the sum of its singular values.
Similar to the development in [8], Q-linear convergence can be derived by assuming (strong) quadratic growth conditions. On the other hand, the following example shows that, different from Lasso problem (29) studied in Sect. 3.3, the (strong) quadratic growth condition is no longer automatically true for the nuclear norm regularized least square optimization problem (34), even when the underlying problem admits a unique solution.
Example 3.1
(Failure of quadratic growth condition for nuclear norm regularized optimization problems) Consider the following optimization problem:
which is a particular case of (34) with \(\mathcal {A}(X)=\left[ \begin{array}{c} X_{11}+X_{22} \\ X_{12}-X_{21}+X_{22} \end{array}\right] \) for any \(X= \left[ \begin{array}{cc} X_{11} &{} X_{12} \\ X_{21} &{} X_{22} \end{array}\right] \in \mathbb {R}^{2\times 2}\), \(B=\left[ \begin{array}{c} 2 \\ 0 \end{array}\right] \), and \(\mu =1\). For \(X:=\left[ \begin{array}{cc} a &{}b \\ c &{}d \end{array}\right] \), let \(\sigma _1\) and \(\sigma _2\) be the singular value of X, we have
Given \(X= \left[ \begin{array}{cc} a &{} b \\ c &{} d \end{array}\right] \), it follows that
Moreover, \(h(X)=\frac{3}{2}\) if \(a+d=1\), \(ad-bc\ge 0\), \(b-c=0\), \(b-c+d=0\), which means \(b=c=d=0\) and \(a=1\). Thus, \(\overline{X}=\left[ \begin{array}{cc} 1 &{}0 \\ 0 &{}0 \end{array}\right] \) is the unique optimal solution to problem (35). Choose \(X_\varepsilon =\left[ \begin{array}{cc} 1-\varepsilon ^{1.5}&{}\varepsilon -\varepsilon ^{1.5} \\ \varepsilon &{}\varepsilon ^{1.5} \end{array}\right] \) with \(\varepsilon >0\) sufficiently small and note that
Observer further that \(\Vert X_\varepsilon -\bar{X}\Vert _F^2=\varepsilon ^3+(\varepsilon -\varepsilon ^{1.5})^2+\varepsilon ^2+\varepsilon ^3=\mathcal {O}(\varepsilon ^2)\). This tells us that \({{\overline{X}}}\) does not satisfied the strong quadratic growth condition for (35). Note that \({\overline{X}}\) is the unique solution, we also see that the quadratic growth condition (2) fails at \({{\overline{X}}}\).
Remark 3.3
Moreover, by setting \(X^0\) as the identity matrix, \(\sigma =1\), and \(\theta =\frac{1}{2}\), we solve problem (35) numerically by FBS (6) with the Beck–Teboulle’s line search and store the quotients \(\delta _k:=\frac{h(X^{k+1})-h(\overline{X})}{h(X^{k+1})-h({{\overline{X}}})}\) and \(\eta _k:=\frac{\Vert X^{k+1}-{{\overline{X}}}\Vert _F}{\Vert X^{k}-{{\overline{X}}}\Vert _F}\). After 276 iterations, both \(\delta _k\) and \(\eta _k\) are close to 1 with error \(10^{-14}\). This suggests that Q-linear convergence unlikely occurs for both sequences \(\{h(X^{k+1})-h(\overline{X})\}\) and \(\{\Vert X^{k}-{{\overline{X}}}\Vert _F\}\).
The quadratic growth condition of nuclear norm regularized problem was studied in [53] under the nondegeneracy condition [53]Footnote 2: \(0\in \mathrm{ri}\, \partial h({{\overline{X}}})\), where \(\mathrm{ri}\, \partial h({{\overline{X}}})\) is the relative interior of \(\partial h({{\overline{X}}})\). In general, although the nondegeneracy condition is an important property in matrix optimization, it can be restrictive for some applications. Without assuming the nondegeneracy condition for (34), the strong quadratic growth condition can be used to guarantee the linear convergence of FBS as in Corollary 2.1 and Sect. 3.2. The strong quadratic growth condition for problem (34) can be characterized via second-order analysis on nuclear norm [17, 52]. On the other hand, the corresponding characterizations are highly non-trivial and are presented in a rather complicated form which may not be able to be easily verified in general. How to obtain easily verifiable and computationally tractable condition ensuring (strong) quadratic growth condition for nuclear norm regularized optimization problem or more generally for matrix optimization problem deserves a separate study and is out of scope of this current paper.
4 Uniqueness of Optimal Solution to \(\ell _1\)-Regularized Least Square Optimization Problems
As discussed in Sect. 1, the linear convergence of ISTA for Lasso was sometimes obtained by imposing an additional assumption that Lasso has a unique optimal solution \(x^*\); see, e.g., [41]. Since \(F_3\) satisfies the quadratic growth condition at \(x^*\) (3.2), the uniqueness of \(x^*\) is equivalent to the strong quadratic growth condition of \(F_3\) at \(x^*\). This observation together with Theorem 3.1 allows us to characterize the uniqueness of optimal solution to Lasso in the next result. A different characterization for this property could be found in [51, Theorem 2.1]. Suppose that \(x^*\) is an optimal solution, which means \(-A^T(Ax^*-b)\in \partial (\mu \Vert \cdot \Vert _1)(x^*)\). In the spirit of Proposition 3.1 with \(f(x)=\frac{1}{2}\Vert Ax-b\Vert ^2\), define
Since \(-A^T(Ax^*-b)\in \partial (\mu \Vert \cdot \Vert _1)(x^*)\), if \(x_j^*\ne 0\), then \((A^T(Ax^*-b))_j=-\mu \,\mathrm{sign} (x^*_j)\). This tells us that \(J=\{j\in \{1,\ldots ,n\}|\; x^*_j\ne 0\}:=\mathrm{supp}\, (x^*)\). Furthermore, given an index set \(I\subset \{1,\ldots , n\}\), we denote \(A_I\) by the submatrix of A formed by its columns \(A_i\), \(i\in I\) and \(x_I\) by the subvector of \(x\in \mathbb {R}^n\) formed by \(x_i\), \(i\in I\). For any \(x\in \mathbb {R}^n\), we also define \(\mathrm{sign}\,(x):= (\mathrm{sign}\,(x_1), \ldots , \mathrm{sign}\,(x_n))^T\) and \(\mathrm{Diag}\,(x)\) by the square diagonal matrix with the main entries \(x_1, x_2, \ldots , x_n\).
Theorem 4.1
(Uniqueness of optimal solution to Lasso problem) Let \(x^*\) be an optimal solution to problem (29). The following statements are equivalent:
-
(i)
\(x^*\) is the unique optimal solution to Lasso (29).
-
(ii)
The system \(A_Jx_J-A_KQ_Kx_K=0\) and \(x_K\in \mathbb {R}^K_+\) has a unique solution \((x_J,x_K)=(0_J,0_K)\in \mathbb {R}^J\times \mathbb {R}^K\), where \(Q_K:=\mathrm{Diag}\,\big [\mathrm{sign}\,(A_K^T(A_Jx^*_J-b))\big ]\).
-
(iii)
The submatrix \(A_J\) has full column rank and the columns of \(A_JA_J^\dag A_KQ_K-A_KQ_K\) are positively linearly independent in the sense that
$$\begin{aligned} \mathrm{Ker}\, (A_JA_J^\dag A_KQ_K-A_KQ_K)\cap \mathbb {R}^K_+= \{0_K\}, \end{aligned}$$(38)where \(A_J^\dag :=(A_J^TA_J)^{-1}A_J^T\) is the Moore–Penrose pseudoinverse of \(A_J\).
-
(iv)
The submatrix \(A_J\) has full column rank and there exists a Slater point \(y\in \mathbb {R}^m\) such that
$$\begin{aligned} (Q_KA_K^T A_JA_J^\dag -Q_KA_K^T)y<0. \end{aligned}$$(39)
Proof
Since \(F_3\) satisfies the quadratic growth condition at \(x^*\) as in Lemma 3.2, (i) means that \(F_3\) satisfies the strong quadratic growth condition at \(x^*\). Thus, by Theorem 3.1, (i) is equivalent to
with \(f(x)=\frac{1}{2}\Vert Ax-b\Vert ^2\) and \( {\mathcal {U}}=\{u\in \mathbb {R}^{\mathcal {E}}|\; u_j(\nabla f(x^*))_j\le 0, j\in K\}\). Note that \(\mathcal {H}_{{\mathcal {E}}}=[\nabla ^2 f(x^*)_{i,j}]_{i,j\in \mathcal {E}}=[(A^TA)_{i,j}]_{i,j\in \mathcal {E}}=A_\mathcal {E}^TA_\mathcal {E}\). Hence, (40) means the system
has a unique solution \(u=(u_J, u_K)=(0_J, 0_K)\in \mathbb {R}^J\times \mathbb {R}^K\), where \( {\mathcal {U}}_K\) is defined by
As observed after (37), \(J=\mathrm{supp}\, (x^*)\), for each \(k\in K\) we have
It follows that \( {\mathcal {U}}_K=-Q_K(\mathbb {R}^K_+)\) and \(Q_K\) is a nonsingular diagonal square matrix (each diagonal entry is either 1 or \(-1\)). Uniqueness of system (41) is equivalent to (ii). This verifies the equivalence between (i) and (ii).
Let us justify the equivalence between (ii) and (iii). To proceed, suppose that (ii) is valid, i.e., the system
has a unique solution \((0_J,0_K)\in \mathbb {R}^J\times \mathbb {R}^K\). Choose \(x_K=0_K\), the latter tells us that equation \(A_Jx_J=0\) has a unique solution \(x_J=0\), i.e., \(A_J\) has full column rank. Thus, \(A_J^TA_J\) is nonsingular. Furthermore, it follows from (42) that \(A_J^TA_Jx_J=A_J^TA_KQ_Kx_K\), which means
This together with (42) tells us that the system
has a unique solution \(x_K=0_K\in \mathbb {R}^K\), which clearly verifies (38) and thus (iii).
To justify the converse implication, suppose that (iii) is valid. Consider equation (42) in (ii), since \(A_J\) has the full rank column, we also have (43). Similar to the above justification, one sees that \(x_K\) satisfies equation (44). Thanks to (38) in (iii), we get from (44) that \(x_K=0_K\) and thus \(x_J=0_J\) by (43). This verifies that equation (42) in (ii) has a unique solution \((x_J,x_K)=(0_J,0_K)\).
Finally, the equivalence between (iii) and (iv) follows from the well-known Gordan’s lemma [11, Theorem 2.2.1] and the fact that the matrix \(A_JA_J^\dag \) is symmetric.
\(\square \)
Next, let us discuss some known conditions relating the uniqueness of optimal solution to Lasso. In [23], Fuchs introduced a sufficient condition for the above property:
The first equality (45) indeed tells us that \(x^*\) is an optimal solution to Lasso problem. Inequality (46) means that \(\mathcal {E}=J\), i.e., \(K=\emptyset \) in Theorem 4.1. (47), is also present in our characterizations. Hence, Fuchs’ condition implies (iii) in Theorem 4.1 and is clearly not a necessary condition for the uniqueness of optimal solution to Lasso problem, since in many situations the set K is not empty.
Furthermore, in the recent work [43] Tibshirani shows that the optimal solution \(x^*\) to problem (29) is unique when the matrix \(A_\mathcal {E}\) has full column rank. This condition is sufficient for our (ii) in Theorem 4.1. Indeed, if \((x_J,x_K)\) satisfies system (42) in (ii), we have \(A_\mathcal {E}\begin{pmatrix}x_J\\ -Q_Kx_K\end{pmatrix}=0\), which implies that \(x_J=0\) and \(Q_Kx_K=0\) when \(\ker A_\mathcal {E}=0\). Since \(Q_K\) is invertible, the latter tells us that \(x_J=0\) and \(x_K=0\), which clearly verifies (ii). Tibshirani’s condition is also necessary for the uniqueness of optimal solution to Lasso problem for almost all b in (29), but it is not for all b; a concrete example could be found in [51].
In the recent works [50, 51], the following useful characterization of unique solution to Lasso has been established under mild assumptions:
It is still open to us to connect this condition directly to those ones in Theorem 4.1, although they must be logically equivalent under the assumptions required in [50, 51]. However, our approach via second-order variational analysis is completely different and also provides several new characterizations for the uniqueness of optimal solution to Lasso. It is also worth mentioning here that the standing assumption in [51] that A has full row rank is relaxed in our study.
5 Conclusion
In this paper, we analyze quadratic growth conditions for some structured optimization problems using second-order variational analysis. This allows us to establish the Q-linear convergence of FBS for
-
Poisson regularized optimization problems and Lasso problems with no assumption on the initial data;
-
\(\ell _1\)-regularized optimization problems with mild assumptions via second-order conditions.
As a by-product, we also obtain full characterizations for the uniqueness of optimal solution to Lasso problem, which complements and extends recent important results in the literature.
Our results in this paper point out several interesting research questions, particularly for extending the approach in this paper for matrix optimization problems such as nuclear norm regularized optimization problems.
-
Firstly, as we have seen in Example 3.1, for nuclear norm regularized optimization problem (34), (strong) quadratic growth condition can fail even for problems with unique solutions. Thus, there is a gap between the uniqueness of the solution and strong quadratic growth condition for (34). How to characterize this gap for the nuclear norm regularized optimization problem or, more generally, for matrix optimization problems would be an important research topic to investigate.
In particular, solution uniqueness for problem (34) has been characterized by the so-called descent cone [13]. Evaluating the descent cone for nuclear norm will help us understand more about solution uniqueness to (34) and understand the gap between the uniqueness of solution with the strong quadratic growth condition for (34).
-
Secondly, what is the tightest possible complexity of FBS in solving the nuclear norm minimization problem? Certainly, the complexity is at least \(o(\frac{1}{k})\) as studied in [7,8,9, 40]. But FBS may fail to exhibit linear convergence when the quadratic growth condition fails, as discussed in Remark 3.3. Due to algebraic structure in nuclear norm, it is natural to conjecture that the complexity is \(\mathcal {O}(\frac{1}{k^\beta })\) with some \(\beta >1\). Finding the optimal \(\beta \) is another research direction which deserves further study.
Notes
In [8], we examined FBS in the more general (possibly infinite dimensional) Hilbert space setting. On the other hand, for the purpose of discussing the structured optimization problems later on, we restrict ourselves to the finite-dimensional setting here.
References
Aragón Artacho, F.J., Geoffroy, M.H.: Characterizations of metric regularity of subdifferentials. J. Convex Anal. 15, 365–380 (2008)
Aragón Artacho, F.J., Geoffroy, M.H.: Metric subregularity of the convex subdifferential in Banach spaces. J. Nonlinear Convex Anal. 15, 35–47 (2014)
Azé, D., Corvellec, J.-N.: Nonlinear local error bounds via a change of metric. J. Fixed Point Theory Appl. 16, 251–372 (2014)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42, 330–348 (2017)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Bauschke, H.H., Noll, D., Phan, H.M.: Linear and strong convergence of algorithms involving averaged nonexpansive operators. J. Math. Anal. Appl. 421, 1–20 (2015)
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery problems. in Convex Optimization in Signal Processing and Communications, (D. Palomar and Y. Eldar, eds.) 42–88 University Press, Cambribge (2010)
Bello-Cruz, J.Y., Li, G., Nghia, T. T.A.: On the Q-linear convergence of forward-backward splitting method. Part I: Convergence analysis. J. Optim. Theory Appl. 188, 378–401 (2021)
Bello Cruz, J.Y., Nghia, T.T.A.: On the convergence of the proximal forward-backward splitting method with linesearches. Optim. Method Softw. 31, 1209–1238 (2016)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)
Borwein, J., Lewis, A.S.: Convex analysis and nonlinear optimization: Theory and Examples. Springer Science & Business Media (2010)
Bredies, K., Lorenz, D.A.: Linear convergence of iterative soft-thresholding. J. Fourier Anal. Appl. 14, 813–837 (2008)
Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Found Comput Math 12, 805–849 (2012)
Combettes, P. L., Pesquet, J.-C.: Proximal splitting methods in signal processing. in Fixed-Point Algorithms for Inverse Problems. Science and Engineering. Springer Optimization and Its Applications 49, 185–212 Springer, New York, (2011)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)
Csiszár, I.: Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist. 19, 2032–2066 (1991)
Cui, Y., Ding, C., Zhao, X.: Quadratic growth conditions for convex matrix optimization problems associated with spectral functions. SIAM Journal on Optimization 27(4), 2332–2355 (2017)
Daubechies, I., Defrise, M., De Mol, D.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 57, 1413–1457 (2004)
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. Splitting Methods in Communications, Image Science, and Engineering. Scientific Computation, Springer, Cham, 2016
Dontchev, A.L., Rockafellar, R.T.: Implicit functions and solution mappings. A View from Variational Analysis, Springer, Dordrecht (2009)
Drusvyatskiy, D., Lewis, A.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43, 693–1050 (2018)
Drusvyatskiy, D., Mordukhovich, B.S., Nghia, T.T.A.: Second-order growth, tilt stability, and metric regularity of the subdifferential. J. Convex Anal. 21, 1165–1192 (2014)
Fuchs, J.-J.: On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory. 50, 1341–1344 (2004)
Grasmair, M., Haltmeier, M., Scherzer, O.: Necessary and sufficient conditions for linear convergence of \(\ell _1\) regularization. Comm. Pure Applied Math. 64, 161–182 (2011)
Garrigos, G., Rosasco, L., and Villa, S.: Convergence of the forward-backward algorithm: beyond the worst case with the help of geometry, arXiv:1703.09477 (2017)
Garrigos, G., Rosasco, L., and Villa, S.: Thresholding gradient methods in Hilbert spaces: support identification and linear convergence, ESAIM: COCV 26 (2020), https://doi.org/10.1051/cocv/2019011
Gilbert, J.C.: On the solution uniqueness characterization in the \(\ell _1\) norm and polyhedral gauge recovery. J. Optim. Theory Appl. 172, 70–101 (2017)
Hale, E.T., Yin, W., Zhang, Y.: Fixed-point continuation for \(\ell _1\)-minimization: methodology and convergence. SIAM J. Optim. 19, 1107–1130 (2008)
Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 23, 702–725 (2002)
Lewis, A.S., Zhang, S.: Partial smoothness, tilt stability, and generalized Hessians. SIAM J. Optim. 23, 74–94 (2013)
Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comp. Math. 18, 1199–1232 (2018)
Liang, J., Fadili, J., Peyré, G.: Local linear convergence of forward-backward under partial smoothness. Adv. Neural Inf, Process Syst (2014)
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward type methods. SIAM J. Optim. 27, 408–437 (2017)
Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46, 157–178 (1993)
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation, I: Basic Theory, II: Applications. Springer, Berlin (2006)
Mousavi, S. and Shen, J.: Solution uniqueness of convex piecewise affine functions based optimization with applications to constrained \(\ell _1\)-minimization, ESAIM: Control Optim. Cal. Variations, 25 (2019), https://doi.org/10.1051/cocv/2018061
Neal, P., Boyd, S.: Proximal algorithms. Found. Trends in Optim. 1, 127–239 (2014)
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175, 69–107 (2019)
Rockafellar, R.T., Wets, R.J.-B.: Variational analysis. Springer, Berlin (1998)
Salzo, S.: The variable metric forward-backward splitting algorithm under mild differentiability assumptions. SIAM J. Optim. 27, 2153–2181 (2017)
Tao, S., Boley, D., Zhang, S.: Local linear convergence of ISTA and FISTA on the Lasso problem. SIAM J. Optim. 26, 313–336 (2016)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58, 267–288 (1996)
Tibshirani, R.J.: The Lasso problem and uniqueness. Electron. J. Stat. 7, 1456–1490 (2013)
Tropp, J.: Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory. 52, 1030–1051 (2006)
Tseng, P.: A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control Optim. 38, 431–446 (2000)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2000)
Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. J. Amer. Statist. Assoc. 80, 8–37 (1985)
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans. Inform. Theory. 55, 2183–2202 (2009)
Yu, P., Li, G., Pong, T.K.: Kurdyka-Łojasiewicz exponent via inf-projection, to appear in Found. Comput. Math. (2021). https://doi.org/10.1007/s10208-021-09528-6
Zhang, H., Yan, M., Yin, W.: One condition for solution uniqueness and robustness of both \(\ell _1\)-synthesis and \(\ell _1\)-analysis minimizations. Adv. Comput. Math. 42, 1381–1399 (2016)
Zhang, H., Yin, W., Cheng, L.: Necessary and sufficient conditions of solution uniqueness in 1-norm minimization. J. Optim. Theory Appl. 164, 109–122 (2015)
Zhang, L., Zhang, N.: Xiao, X: On the second-order directional derivatives of singular values of matrices and symmetric matrix-valued functions. Set-Valued and Variational Analysis 21(3), 557–586 (2013)
Zhou, Z., So, A.M.-C.: A unified approach to error bounds for structured convex optimization. Math. Program. 165, 689–728 (2017)
Acknowledgements
The authors are indebted to both anonymous referees for their careful readings and thoughtful suggestions that allowed us to improve the original presentation significantly.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Liqun Qi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was partially supported by the National Science Foundation (NSF) Grant DMS - 1816386 and 1816449, and a Discovery Project from Australian Research Council (ARC), DP190100555.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bello-Cruz, Y., Li, G. & Nghia, T.T.A. Quadratic Growth Conditions and Uniqueness of Optimal Solution to Lasso. J Optim Theory Appl 194, 167–190 (2022). https://doi.org/10.1007/s10957-022-02013-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-022-02013-2
Keywords
- Nonsmooth and convex optimization problems
- Forward–backward splitting method
- Linear convergence
- Uniqueness
- Lasso
- Quadratic growth condition
- Variational analysis