Abstract
We study the convergence rate of the proximal-gradient homotopy algorithm applied to norm-regularized linear least squares problems, for a general class of norms. The homotopy algorithm reduces the regularization parameter in a series of steps, and uses a proximal-gradient algorithm to solve the problem at each step. Proximal-gradient algorithm has a linear rate of convergence given that the objective function is strongly convex, and the gradient of the smooth component of the objective function is Lipschitz continuous. In many applications, the objective function in this type of problem is not strongly convex, especially when the problem is high-dimensional and regularizers are chosen that induce sparsity or low-dimensionality. We show that if the linear sampling matrix satisfies certain assumptions and the regularizing norm is decomposable, proximal-gradient homotopy algorithm converges with a linear rate even though the objective function is not strongly convex. Our result generalizes results on the linear convergence of homotopy algorithm for \(\ell _1\)-regularized least squares problems. Numerical experiments are presented that support the theoretical convergence rate analysis.
Similar content being viewed by others
Notes
For general psd B, the example are \(A_i = B^{-\frac{1}{2}} A_i' \) with \(A_{i}' \sim \mathcal {N}(0,I_n)\) or \(A_{i,j}'\) Rademacher for all j.
References
Agarwal, A., Negahban, S., Wainwright, M.J.: Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In: NIPS, vol. 23, pp. 37–45 (2010)
Bunea, F., Tsybakov, A., Wegkamp, M., et al.: Sparsity oracle inequalities for the lasso. Electron. J. Stat. 1, 169–194 (2007)
Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Candes, E., Plan, Y.: Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inf. Theory 57(4), 2342–2359 (2011)
Candès, E., Recht, B.: Simple bounds for recovering low-complexity models. Math. Program. 141(1–2), 577–589 (2013)
Candes, E., Tao, T.: Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006)
Candes, E., Tao, T.: he dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35(6), 2313–2351 (2007)
Candes, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)
Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A.S.: The convex geometry of linear inverse problems. Found. Comput. Math. 12(6), 805–849 (2012)
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Gordon, Y.: Some inequalities for gaussian processes and applications. Israel J. Math. 50(4), 265–289 (1985)
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57(3), 1548–1566 (2011)
Hale, E.T., Yin, W., Zhang, Y.: Fixed-point continuation for \(\ell _1\)-minimization: methodology and convergence. SIAM J. Optim. 19(3), 1107–1130 (2008)
Hou, K., Zhou, Z., So, A.M., Luo, Z.q.: On the linear convergence of the proximal gradient method for trace norm regularization. In: Advances in Neural Information Processing Systems, pp. 710–718 (2013)
Jain, P., Meka, R., Dhillon, I.S.: Guaranteed rank minimization via singular value projection. NIPS 23, 937–945 (2010)
Jin, R., Yang, T., Zhu, S.: A new analysis of compressive sensing by stochastic proximal gradient descent. CoRR abs/1304.4680 (2013)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes, vol. 23. Springer, New York (2013)
Liu, Z., Vandenberghe, L.: Interior-point method for nuclear norm approximation with application to system identification. SIAM J. Matrix Anal. Appl. 31(3), 1235–1256 (2009)
Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Stat. 39(4), 2164–2204 (2011)
Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)
Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992)
Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimization. Math. Program. 128(1), 321–353 (2011)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
Mendelson, S., Pajor, A., Tomczak-Jaegermann, N.: Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17(4), 1248–1282 (2007)
Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien (french). CR Acad. Sci. Paris 255, 2897–2899 (1962)
Needell, D., Tropp, J.A.: Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harm. Anal. 26(3), 301–321 (2009)
Negahban, S.N., Ravikumar, P., Wainwright, M.J., Yu, B.: A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y., Nemirovski, A.: On first-order algorithms for \(\ell _1\)/nuclear norm minimization. Acta Numer. 22, 509–575 (2013)
Nesterov, Y., Nesterov, I.E.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Amsterdam (2004)
Nguyen, N., Needell, D., Woolf, T.: Linear convergence of stochastic iterative greedy algorithms with sparse constraints. arXiv preprint arXiv:1407.0088 (2014)
Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011)
Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Rohde, A., Tsybakov, A.B., et al.: Estimation of high-dimensional low-rank matrices. Ann. Stat. 39(2), 887–930 (2011)
Shalev-Shwartz, S., Gonen, A., Shamir, O.: Large-scale convex minimization with a low-rank constraint. arXiv preprint arXiv:1106.1622 (2011)
Shalev-Shwartz, S., Srebro, N., Zhang, T.: Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20(6), 2807–2832 (2010)
Talagrand, M.: The Generic Chaining, vol. 154. Springer, Berlin (2005)
Toh, K.C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optim. 6(615–640), 15 (2010)
Van De Geer, S.A., Bühlmann, P., et al.: On the conditions used to prove oracle results for the lasso. Electron. J. Stat. 3, 1360–1392 (2009)
Wen, Z., Yin, W., Goldfarb, D., Zhang, Y.: A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM J. Sci. Comput. 32(4), 1832–1857 (2010)
Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)
Zhang, H., Jiang, J., Luo, Z.Q.: On the linear convergence of a proximal gradient method for a class of nonsmooth convex minimization problems. Journal of the Operations Research Society of China 1(2), 163–186 (2013)
Acknowledgments
The authors are greatly indebted to Dr. Lin Xiao from Microsoft Research, Redmond, for his many valuable comments and suggestions. We thank Amin Jalali for his comments and helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Additional information
This material is based upon work supported by the National Science Foundation under Grant No. ECCS-0847077, and in part by the Office of Naval Research under Grant No. N00014-12-1-1002.
Appendices
Appendix 1
In this section we give a lower bound on the number of measurements m that suffice for the existence of \(r>1\) in Assumption 1 with high probability when A is sampled from a certain class of distributions. To simplify the notation we assume that \(B = I\); therefore, \(\langle {x} , {y} \rangle = x^T y\). Given a random variable z the sub-Gaussian norm of z is defined as:
where \(\psi _2(x) = e^{x^2}-1\). For an n dimensional random vector \(w \sim P\) the sub-Gaussian norm is defined as
P is called isotropic if \(E \big [\langle {w} , {u} \rangle ^2\big ] = 1\) for all \(u \in S^{n-1}\). Two important examples of sub-Gaussian random variables are Gaussian and bounded random variables. Suppose \(A: \mathbb {R}^{n} \mapsto \mathbb {R}^{m}\) is given by:
where \(A_i\), \(1\le i \le m\) are iid samples from an isotropic sub-Gaussian distribution P on \(\mathbb {R}^n\). Two important examples are standard Gaussian vector \(A_{i}\sim \mathcal {N}(0,I_n)\) and random vector of independent Rademacher variables.Footnote 1 We want to bound the following probabilities for \(\theta \in (0,1)\):
When \(A_{i} \sim \mathcal {N}(0,I_n)\) for all i, one can use the generalization of Slepian’s lemma by Gordon [11] alongside concentration inequalities for Lipschitz function of Gaussian random variable to derive (see, for example, [17, chapter 15]):
whenever,
Here, G is defined as:
where \(g \sim \mathcal {N}(0,I_n)\). For sub-Gaussian case, we use a result by Mendelson et al.[24, Theorem 2.3]. Using Talgrand’s generic chaining theorem [40, Theorem 2.1.1], the authors have given a result, which similar to the Gaussian case depends on G(k). Their result in our notation states:
Proposition 3
Suppose A is given by (35). If P is an isotropic distribution and \({\left\| A_1\right\| }_{\psi _2} \le \alpha \), then there exist constants \(c_1\) and \(c_2\) such that
with probability exceeding \(1 - \exp {(-c_2 \theta ^2 m / \alpha ^4)}\) whenever
Suppose \(\lambda _\mathrm{tgt} = 4 {\left\| A^* z\right\| }^*\), which sets \(\gamma = \frac{5+4\delta }{3-4\delta }\). We can state the following proposition based on Proposition 3 :
Proposition 4
Let \(r>1\), \({\tilde{k}} = 36 r c k_{0} (1+\gamma )\gamma _\mathrm{inc}\) and \({\bar{k}} = c k_0(1+\gamma )^2\). If \(m \ge \frac{{c_1} \alpha ^4 }{(r-1)^2} (G(2{\tilde{k}})^2 + r^2 G({\bar{k}})^2)\), then r satisfies Assumption 1 with probability exceeding \(1 - \exp (c_2 (r-1)^2 m / r^2\alpha ^2)\).
The proof is a simple adaptation of proof of Theorem 1.4 in [24] which we omit here. To compare this with the number of measurements sufficient for successful recovery within a given accuracy, by combining (59) in the proof Lemma 1 and Proposition 3 we get:
Proposition 5
Let \(r>1\), \({\bar{k}} = c k_0(1+\gamma )^2\) and \(x^* \in {{\mathrm{\arg \min }}}\phi _{\lambda }(x)\). If \(m \ge \frac{{c_1} \alpha ^4 r^2}{(r-1)^2} G({\bar{k}})^2\), then \({\left\| x^* - x_0\right\| }_2 \le {c_2 r\lambda \sqrt{ck_{0}} }\) with probability exceeding \(1 - \exp (c_2 (r-1)^2 m / r^2\alpha ^2)\).
Note that this bound on m in case of \(l_1\), \(l_{1,2}\) and nuclear norms orderwise matches the lower bounds given by minimax rates in [19, 33] and [37].
Appendix 2
1.1 Proof of Theorem 1
Sufficiency. First consider the case where \(k = 1\) and \(x = \gamma _1 a_1\) with \(\gamma _1 > 0\). Note that \(a_1 \in \partial {\left\| x\right\| }=\partial {\left\| a_1\right\| }\) because \({\left\| a_1\right\| }^*= 1\) for all \(a_1 \in \mathcal {G}_{{\left\| \cdot \right\| }}\) and \(\langle {a_1} , {x} \rangle = \gamma _1 = {\left\| x\right\| }\). Define:
Note that C is a convex set that contains the origin. Moreover, C is orthogonal to \(a_1\). We claim that (9) is satisfied with \(T_{a_1}^\bot = {{\mathrm{span}}}{C}\). To establish the claim, we first prove that C is symmetric and is contained in the dual norm ball. Let \(v\in C\) and \(\xi = a_1 + v \in \partial {{\left\| a_1\right\| }}\). By (4), \(\langle {a_1} , {\xi } \rangle = {\left\| \xi \right\| }^* = 1\). Therefore,
and we can apply the hypothesis of the theorem (in particular statement I) to obtain an orthonormal representation for \(\xi \):
Now by statement II in the hypothesis we get:
Let \(\xi ' = a_1-\sum _{i=1}^{l}{\eta _i b_i}\). By the hypothesis, \({\left\| \xi '\right\| }^* = \max \{1,\max _{i}{\eta _i}\} = 1\). Also, \(\langle {\xi '} , {a_1} \rangle =1\) hence \(\xi ' \in \partial {\left\| a_1\right\| }\) and \(-v \in C\).
Let \(v \in {{\mathrm{span}}}{C}\) with \({\left\| v\right\| }^*\le 1\). Since C is a symmetric convex set, there exists \(\lambda \in (0,1]\) such that \(\lambda v \in C\) (i.e., C is absorbing in \({{\mathrm{span}}}{C}\)). Define \(z = a_1+\lambda v\) which is in \(\partial {{\left\| a_1\right\| }}\). Since \(\langle {a_1} , {z} \rangle = {\left\| z\right\| }^* = 1\), we can write z as
where \(\{c_i|i=1,\ldots , k'\} \subset \mathcal {G}_{{\left\| \cdot \right\| }}\) and \(\{\nu _i \ge 0| i=1, \ldots , k'\}\) satisfy the hypothesis of the theorem. In particular, since \(v = 1/\lambda \sum _{i=1}^{k'}{\nu _i c_i}\), we have \(\max _i{\nu _i/\lambda } \le 1\). Hence \({\left\| a_1 + v\right\| }^* = \max \{1,\nu _1/\lambda , \ldots \nu _{t'}/\lambda \} = 1\) and \(a_1+v \in \partial {{\left\| a_1\right\| }}\). Therefore,
Now suppose that \(x = \sum _{i=1}^{k}{\gamma _i a_i}\) with \(k > 1\). Note that \({\sum _{i=1}^{k}{a_i}} \in \partial {\left\| x\right\| }\) since \({\left\| \sum _{i=1}^{k}{a_i}\right\| }^* = 1\) and \(\langle {\sum _{i=1}^{k}{a_i}} , {x} \rangle = \sum _{i=1}^{k}\gamma _i = {\left\| x\right\| }\). Let \(\xi \in \partial {{\left\| x\right\| }}\) and define \(v = \xi - \sum _{i=1}^{k}{a_i}\). We can write:
Also, since \(\sum _{i=1}^{k}{a_i} \in \partial {\left\| a_i\right\| }\), (40) results in:
Since \(\xi = \sum _{i=1}^{k}{a_i} + v \in \partial {\left\| a_1\right\| }\), we have \({\left\| \sum _{i=2}^{k}{a_i} + v\right\| }^* = 1\) hence \(\sum _{i=2}^{k}{a_i} + v \in \partial {\left\| a_2\right\| }\). By induction, we conclude that \(a_k + v \in \partial {\left\| a_k\right\| }\). This implies \({\left\| v\right\| }^*\le 1\).
Let \(v' \in \cap _{i \in \{1,2,\ldots ,k\}}{T_{a_{i}}^{\bot }}\) with \({\left\| v'\right\| }^*\le 1\) and define \(\xi ' = \sum _{i=1}^{k}{a_i}+ v'\). We will prove that \({\left\| \xi '\right\| }^* \le 1\) and hence \(\xi ' \in \partial {{\left\| x\right\| }}\) . To prove this we use induction. Define
Note that \({\left\| z_1\right\| }^* \le 1\) since \(z_1 = a_k+v' \in \partial {\left\| a_k\right\| }\). Suppose \({\left\| z_{l'}\right\| }^* \le 1\) for some \(l' < k \). We prove that \({\left\| z_{l'+1}\right\| }^* \le 1\). We have \(\sum _{i=k-l'+1}^{k}{a_i} \in T_{a_{k-l'}}^{\bot }\) because \(\sum _{i=k-l'}^{k}{a_i} = a_{k-l'} + \sum _{i=k-l'+1}^{k}{a_i} \in \partial {\left\| a_{k-l'}\right\| }\). Combining this with the fact that \(v' \in T_{a_{k-l'}}^{\bot }\), we get \(z_{l'} \in T_{a_{k-l'}}^{\bot }\). Therefore, \(z_{l'+1}= a_{k-l'}+z_{l'} \in \partial {\left\| a_{k-l'}\right\| }\) hence \({\left\| z_{l'+1}\right\| }^*\le 1\). Thus \({\left\| \xi '\right\| }^* = {\left\| z_k\right\| }^*\le 1\). We conclude that:
Necessity. For any \(a\in \mathcal {G}_{{\left\| \cdot \right\| }}\), we have:
That implies \({\left\| a\right\| }^* = 1\) and \(a \in \partial {\left\| a\right\| }\). Since \(a\in T_a\), we conclude that:
Take \(\gamma _1 = \langle {a_1} , {x} \rangle ={\left\| x\right\| }^*\) and let \({\varDelta }_1 = x - {\gamma _1} a_1\). If \({\varDelta }_1 = 0\), then take \(k =1\) and \(x = \gamma _1 a_1\). Suppose \({\varDelta }_1 \ne 0\). Since \({\left\| \frac{1}{\gamma _1}x\right\| }^* = 1\) and \(\langle {a_1} , {{\frac{1}{\gamma _1}x}} \rangle = {\left\| a_1\right\| } = 1\), we can conclude that \(\frac{1}{\gamma _1}x \in \partial {\left\| a_1\right\| }\). Furthermore, we have
Now we introduce a lemma that will be used in the rest of the proof.
Lemma 4
Suppose \(a \in \mathcal {G}_{{\left\| \cdot \right\| }}\) and \(y \in T^{\perp }_{a} - \{0\}\). If \(z \in \mathcal {B}_{{\left\| \cdot \right\| }}\) is such that \({\left\| y\right\| }^* = \langle {y} , {z} \rangle \), then \(z \in T^{\perp }_{a}\).
Proof
Without loss of generality assume that \({\left\| y\right\| }^* = 1\). It suffices to show that if \(b \in \mathcal {G}_{{\left\| \cdot \right\| }}\) and \( \langle {y} , {b} \rangle = 1\), then \(b \in T^{\perp }_{a}\). Consider such \(b \in \mathcal {G}_{{\left\| \cdot \right\| }}\). By (43), \({\left\| a+y\right\| }^* = 1\). That results in:
By considering \(-y\) and \(-b\) we get that \(\langle {a} , {b} \rangle = 0\). Since \(\langle {a+y} , {b} \rangle = {\left\| b\right\| } = 1\), we can conclude that \(a+y \in \partial {\left\| b\right\| }\). Since \( \langle {y} , {b} \rangle = 1\) and \({\left\| y\right\| }^* = 1\), \(y \in \partial {\left\| b\right\| }\). Combining these two conclusions, we get:
\(\square \)
Suppose that there exist \(l \in \{1,2,\ldots ,k\}\), an orthogonal set \(\{a_i \in \mathcal {G}_{{\left\| \cdot \right\| }}|\, i=1,2,\ldots , l\}\), and a set of coefficients \(\{\gamma _i \ge 0 |\, i = 1,2,\ldots ,l\}\) such that \(x = \sum _{i=1}^{l}{\gamma _i a_i} + {\varDelta }_l\), \({\varDelta }_l \in \cap _{i=1}^{l}{T_{a_i}^{\bot }} \), and:
By Lemma 4, there exists \(a_{l+1} \in \mathcal {G}_{{\left\| \cdot \right\| }}\) such that \(a_{l+1} \in \cap _{i=1}^{l}{T_{a_i}^{\bot }}\) and \(\langle {a_{l+1}} , {{\varDelta }_l} \rangle = {\left\| {\varDelta }_l\right\| }^*\). Take \(\gamma _{l+1} = \langle {a_{l+1}} , {{\varDelta }_l} \rangle = {\left\| {\varDelta }_l\right\| }^*\) and let \({\varDelta }_{l+1} = {\varDelta }_l - \gamma _{l+1} a_{l+1}\). We have \({\varDelta }_{l+1} \in \bigcap _{i=1}^{l}{T_{a_i}^{\bot }}\) because \(\{{\varDelta }_l, a_{l+1}\} \subset \bigcap _{i=1}^{l}{T_{a_i}^{\bot }}\). Since \({\left\| \frac{1}{\gamma _{l+1}}{\varDelta }_{l}\right\| }^* = 1\) and \(\langle {a_{l+1}} , {{\frac{1}{\gamma _{l+1}}{\varDelta }_{l}}} \rangle = {\left\| a_{l+1}\right\| } = 1\), we can conclude that \(\frac{1}{\gamma _{l+1}}{\varDelta }_l \in \partial {\left\| a_{l+1}\right\| }\). Using the same reasoning as in (44), we have \({\varDelta }_{l+1} \in T_{a_{l+1}}^{\bot }\) hence \( {\varDelta }_{l+1} \in \bigcap _{i=1}^{l+1}{T_{a_i}^{\bot }}\).
By decomposability assumption there exists \(e\in \mathbb {R}^n\) and a subspace T such that:
We claim that
To prove the first claim, it is enough to show that \(\sum _{i=1}^{l+1}{a_i} \in \partial {\left\| \sum _{i=1}^{l+1}{a_i}\right\| }\). Note that \( {\left\| \sum _{i=1}^{l+1}{a_i} \right\| }^*\le 1\) since \(\sum _{i=1}^{l+1}{a_i}= \sum _{i=1}^{l}{a_i}+a_{l+1} \in \partial {\left\| \sum _{i=1}^{l}{a_i}\right\| }\) which is given by (45). Now we can write:
On the other hand, by triangle inequality,
thus
Therefore, \(\sum _{i=1}^{l+1}{a_i} \in \partial {{\left\| \sum _{i=1}^{l+1}{a_i}\right\| }}\). Since \(\sum _{i=1}^{l+1}{a_i}\in T_{\sum _{i=1}^{l+1}{a_i}} = T\), we conclude that:
To prove (48), we first show that \(\bigcap _{i=1}^{l+1}{T_{a_i}^{\bot }} \in T^{\bot }\). Let \(\xi = e+v\) with \(v \in \bigcap _{i=1}^{l+1}{T_{a_i}^{\bot }}\). Note that \({\left\| a_{l+1}+v\right\| }^*\le 1\) since \(a_{l+1}+v \in \partial {{\left\| a_{l+1}\right\| }}\). Furthermore, \(a_{l+1}+v \in \bigcap _{i=1}^{l}{T_{a_i}^{\bot }}\), which in turn implies \({\sum _{i=1}^{l+1}{a_i}+v} \in \partial {{\left\| \sum _{i=1}^{l}{a_i}\right\| }}\) hence \({\left\| \sum _{i=1}^{l+1}{a_i}+v\right\| }^* \le 1\). Additionally, we have:
Hence \(\xi \in \partial {\left\| \sum _{i=1}^{l+1}{a_i}\right\| }\) and \(v \in T^{\bot }\).
Now, let \(\xi ' = \sum _{i=1}^{l+1}{a_i}+v' \in {\left\| \sum _{i=1}^{l+1}{a_i}\right\| }\). Note that:
moreover, \(\sum _{i=1}^{l}{a_i} \in T_{a_{l+1}}^{\bot }\) since \(\sum _{i=1}^{l+1}{a_i} \in \partial {\left\| a_{l+1}\right\| }\). This implies \(v \in \bigcap _{i=1}^{l+1}{T_{a_i}^{\bot }}\) which completes the proof of (48).
Because \(a_i \notin T_{a_i}^{\bot }\) for all \(i \in \{1,2,\ldots ,l+1\}\), \(dim(\cap _{i = 1}^{l+1}T_{a_i}^{\bot } )\le n-l-1 \). Hence there exists \(k \le n\), an orthogonal set \(\{a_i \in \mathcal {G}_{{\left\| \cdot \right\| }},\, i = 1, 2, \ldots , k\}\), and a set of coefficients \(\{\gamma _i \ge 0 \text{, }\, i\in \{1,2,\ldots ,k\}\}\) such that \(x = \sum _{i=1}^{k}{\gamma _i a_i}\) and:
That proves \({\left\| x\right\| } = \langle {\sum _{i=1}^{k}{a_i}} , {x} \rangle = \sum _{i=1}^{k}{\gamma _i}.\)
To prove statement II, we first prove that \(a_i \in T_{a_j}^{\bot }\) for all \(i,j \in \{1,2,\ldots ,k\}\). By (49), \({\left\| \sum _{i=1}^{k}{a_i}\right\| }^*\le 1\). We can write:
Now the claim follows from Lemma 4.
Let \(l=| \{\eta _i|\eta _i \ne 0\}|\). If \(l = 0\), the statement is trivially true. Suppose the statement is true when \(l=l'-1\) for some \(l' \in \{1,\ldots ,n\}\) and consider the case where \(l=l'\). Suppose that \(|\eta _j| = \max _{i}{|\eta _i|}\). By proper normalization we can assume that \(\eta _j = 1\). Let \(y = \sum _{i\ne j}{\eta _ia_{i}}\). We can deduce the following properties for y:
By the decomposability assumption \(\sum _{i=1}^{k}{\eta _ia_{i}}= a_j + y \in \partial {\left\| a_j\right\| }\) hence \({\left\| \sum _{i =1}^{k}{\eta _ia_{i}}\right\| }^*\le 1\). Hence \({\left\| \sum _{i=1}^{k}{\eta _ia_{i}}\right\| }^* = 1\). \(\square \)
Remark 1
Let \(x = \sum _{i=1}^{K(x)}\gamma _i a_i\). Since \({T^{\bot }_{x}} = \bigcap _{i=1}^{K(x)}{T_{a_i}^{\bot }}\), a more general version of lemma 4 holds:
Lemma 5
Suppose \(x \in \mathbb {R}^n\) and \(y \in T^{\perp }_{x} - \{0\}\). If \(z \in \mathcal {B}_{{\left\| \cdot \right\| }}\) is such that \({\left\| y\right\| }^* = \langle {y} , {z} \rangle \), then \(z \in T^{\perp }_{x}\).
We state and prove a dual version of Lemma 5, which will be used in the proof of Lemma 1 and Lemma 2.
Lemma 6
Let \(x \in \mathbb {R}^n\). If \(y \in T^{\perp }_{x}\), then there exists \(z \in T_{x}^{\bot } \cap \mathcal {B}_{{\left\| \cdot \right\| }^*}\) such that \({\left\| y\right\| } = \langle {y} , {z} \rangle \).
Proof
If \(y = 0\), then the lemma is trivially true. If \(y \ne 0\), then:
Therefore, by Lemma 5, we get
\(\square \)
1.2 Proof of Theorem 2
First, we introduce a lemma.
Lemma 7
Let \(\{a_1,\ldots , a_k\}\) be an orthogonal subset of \(\mathcal {G}_{{\left\| \cdot \right\| }}\) that satisfies II in Theorem 1. Let \(y = \sum _{i=1}^{k}\beta _i a_i\), with \(\beta _i \in \mathbb {R}\) for all i, then
Proof
Let \(k' = |\{i\;|\; \beta _i \ne 0\}|\). Without loss of generality assume that \(\beta _i \ne 0\) for \(i \le k'\) and \(\beta _i = 0\) for \(i > k'\). Let \(\eta _i = \mathrm{sgn}(\beta _i)\) and \(a'_i = \mathrm{sgn}(\beta _i) a_i\) for all \(i \le k'\). Since \(a_1, \ldots , a_k\) satisfy Condition II in the orthogonal representation theorem, so do \(a'_1,\ldots , a'_{k'}\).
Now we show that y and \(a'_1,\ldots , a'_{k'}\) satisfy Condition I. By (11), \({\left\| \sum _{i=1}^{k'} a'_i\right\| }^* \le 1\). Therefore,
Therefore, by the orthogonal representation theorem, \(e_y = \sum _{i=1}^{k'} a'_i\). Thus \(K(y) = {\left\| e_y\right\| }_2^2 = k'\). \(\square \)
For any \(x \in \mathbb {R}^n - \{0\}\) define
Define \(l(0) = 0\). Now the proof is a simple consequence of the following lemma:
Lemma 8
For all \(x \in \mathbb {R}^n\), \(l(x) = K(x)\).
Proof
\(K(x) \ge l(x)\) by the definition of l(x). We prove that \(K(x) = l(x)\) by induction on K(x). When \(K(x) \in \{0,1\}\), the statement is trivially true. Suppose the statement is true when \(K(x) \in \{0,1,2,\ldots ,k-1\}\). Consider the case where \(K(x) = k\). By way of contradiction, suppose \(l(x) < K(x)\). Let
where \(\gamma _1, \ldots , \gamma _k\) and \(a_1,\ldots , a_k\) are given by the orthogonal representation theorem. If \(l(x) = 1\), then:
for some \(\alpha _1 \ne 0\) and \(b_1 \in \mathcal {G}_{{\left\| \cdot \right\| }}\). Since \(|\alpha _1|= {\left\| \alpha _1 b_1\right\| } = {\left\| x\right\| } = \sum _{i=1}^{k} \gamma _i \), either \(b_1\) or \(-b_1\) can be written as convex combination of \(a_1, \ldots , a_k\) which contradicts the fact that \(b_1\in \mathcal {G}_{{\left\| \cdot \right\| }}\).
If \(l(x) = l > 1\), we can write x as:
with \(\{b_1, \ldots , b_l\} \subseteq \mathcal {G}_{{\left\| \cdot \right\| }}\). By turning \(b_i\) to \(- b_i\) without loss of generality we assume that \(\alpha _i > 0\) for all i. Let \(u = 2 \alpha _1 b_1 \) and \(v = 2 \sum _{i=2}^{l} \alpha _i b_i\) and note that \(x = (u+v)/2\). Let \(C = \mathrm{Cone}{\{a_1, a_2,\ldots , a_k\}}\). Let \(\mathrm{int} C\) and \(\mathrm{bd} C\) denote the interior and the boundary of C, respectively. Note that \(u \notin \mathrm{int C}\) because by Lemma 7, if \(u \in \mathrm{int C}\), then \(K(u) = k\); however, \(l(u) = 1\). Now we consider two cases for v.
-
Case 1.
If \(v \in \mathrm{int C}\), then we can write v as a conic combination of \(a_1, a_2, \ldots , a_k\) with positive coefficients:
$$\begin{aligned} v = 2 \sum _{i=2}^{l} \alpha _i b_i = \sum _{i=1}^{k} c_i a_i, \end{aligned}$$where \(c_i > 0\) for all i.
-
Case 2.
If \(v \notin \mathrm{int} C\). let \(L = \{\theta u + (1-\theta ) v \; | \; \theta \in [0,1]\}\). Since L intersects the interior of C at x and \(\{u, v\} \notin \mathrm{int} C\), there exists \(u',v'\) such that \(L \cap \mathrm{bd} C = \{u',v'\}\). Suppose \(v'\) is on the line segment between v and x (see Fig. 5). Let \(L'= \{\theta u + (1-\theta ) v' \; | \; \theta \in [0,1]\}\) and note that \(x \in L'\). Since \(v' \in \mathrm{bd} C\), it can be written as conic combination of at most \(k-1\) of \(a_1, \ldots , a_k\). Without loss of generality assume that \(v' = \sum _{i=2}^{k}{\beta _i a_i}\). For some \(\theta \in (0,1)\):
$$\begin{aligned} x = \theta u + (1-\theta ) v' = \alpha _1' b_1 + \sum _{i=2}^{k}{\beta _i' a_i}, \end{aligned}$$where \(\alpha _1' = 2 \theta \alpha _1\) and \(\beta '_i = (1 -\theta ) \beta _i\). Using the representation in (50), we get:
$$\begin{aligned} \alpha _1' b_1&= \gamma _1 a_1 + \sum _{i=2}^{k} (\gamma _i - \beta _i') a_i. \end{aligned}$$We have \(l(\alpha _1' b_1) = 1\), and by Lemma 7, \(K(\alpha _1' b_1) = 1 + |\{i | \gamma _i \ne \beta _i', i = 2, \ldots , k \}|\). Therefore, \(\gamma _i = \beta _i'\) for all \(i = 2,\ldots ,k\) and \(b_1 = a_1\). Combining the previous fact with (50) and (51), we get:
$$\begin{aligned} x - \alpha _1 a_1 = (\gamma _1 - \alpha _1) a_1 + \sum _{i=2}^{k} \gamma _i a_i = \sum _{i=2}^{l} \alpha _i b_i. \end{aligned}$$(52)If \(\gamma _{1} = \alpha _1\), by the induction hypothesis \(k = l\) , which is a contradiction. Now, suppose \(\gamma _1 - \alpha _1 \ne 0\). In both cases we produced a point \(y = v\) such that \(K(y) = k\) and \(l(y) \le l-1\). We can continues this procedure until we get a y such that \(K(y) = k\) and \(l(y) = 1\), which gives us the contradiction. \(\square \)
1.3 Proof of Proposition 2
In iteration \(t+1\) when the backtrack procedure stops, the following inequality holds true:
On the other hand, by (19), we have
which ensures \(M_{t+1} \le \gamma _\mathrm{inc} L_f\) since \(m_{L}(x^{(t)},x^{(t+1)})\) is non-decreasing in L. By (17), we have:
If we confine x to \(\{\alpha x^{*}+(1-\alpha )x^{(t)}\,|\, 0 \le \alpha \le 1\}\), inequality (53) combined with (54) results in
The RHS of the above inequality is minimized for \(\alpha ^* = \min \{1,\frac{\mu _f}{2 \gamma _\mathrm{inc} L_f}\}\). Therefore, we get
To prove (23), we note that the backtrack stopping criteria ensures
The hypothesis (18) ensures \(M_{t+1} \ge \mu _f\). Combining (16) and (55) and using the lower and the upper bounds on \(M_{t+1}\), we get the desired result
1.4 Proof of Lemma 1
By the hypothesis there exists \(\xi \in \partial {\left\| x\right\| }\) such that \({\left\| A^*(A x-b)+\lambda \xi \right\| }^* \le \delta \lambda \). Therefore, we can write
Now we lower-bound \({\left\| x\right\| }\):
By Lemma 6, there exists \(s \in T_{x_{0}}^{\bot }\) such that \(\langle {s} , {P_{{{T}_{x_{0}}}^{\bot }}(x-x_{0})} \rangle = {\left\| P_{{{T}_{x_{0}}}^{\bot }}(x-x_{0})\right\| }\) and \({\left\| s\right\| }^* = 1\). Note that \(e_{x_0}+s \in \partial {{\left\| x_{0}\right\| }}\) hence \({\left\| e_{x_0}+s\right\| }^* \le 1\). Therefore, we get:
Combining (57) and (56), we get
By applying triangle inequality to \({\left\| x-x_0\right\| }\), we obtain
That yields
Using the definition of the lower restricted isometry constant, we derive
which yields the following bounds
By convexity of \(\phi _{\lambda }\),
1.5 Proof of Lemma 2
Let \({\varDelta }=\frac{3 ck_{0} \lambda (1+\gamma )}{2 \rho _{-}(A,{ c(1+\gamma )^2 k_{0}})}\). We can write
If \({\left\| x-x_{0}\right\| } \le {\varDelta }\), half of the conclusion is immediate. To get the second half, we can expand the left hand side of (61) to get:
Suppose \({\left\| x-x_{0}\right\| } > {\varDelta }\), then from (61) we get:
By using (57) and triangle inequality we get:
Using the same reasoning as in the proof of Lemma 1, we get the desired results.
1.6 Proof of Lemma 3
By first order optimality condition there exists \(\xi \in \partial {\left\| x^{+}\right\| }\) such that:
Note that \(\xi = e_{x^{+}} + v\) for some \(v \in T_{x^{+}}^{\bot }\). By Lemma 6, there exists \(v' \in T_{x^{+}}^{\bot } \cap \mathcal {B}_{{\left\| \cdot \right\| }^*}\) such that \(\langle {v'} , {v} \rangle = {\left\| v\right\| }\). Since \(e_{x^+} + v' \in \partial {\left\| x^{+}\right\| }\), \({\left\| e_{x^+} + v'\right\| }^* \le 1\). Therefore, we can write:
Let \(\xi = \sum _{i=1}^{l}\gamma _i a_i\), where \(a_1,\ldots ,a_{l}\) and \(\gamma _1, \ldots , \gamma _{l}\) are given by the orthogonal representation theorem. Since \(\gamma _i \le 1\) for all i, \(l \ge {\left\| \xi \right\| }\). If \({\left\| \xi \right\| } > {\tilde{k}}\), we can define \(u = \sum _{i=1}^{{\tilde{k}}}{a_i}\), then
Since \(\phi _{\lambda }(x^{+}) \le \phi _{\lambda }{(x)}\), by Lemma 2, we have:
Define
We can rewrite (62) as:
But this contradicts Assumption 1, so \({\left\| \xi \right\| } \le {\tilde{k}}\) hence \(K(x^{+}) \le {\tilde{k}}\).
Rights and permissions
About this article
Cite this article
Eghbali, R., Fazel, M. Decomposable norm minimization with proximal-gradient homotopy algorithm. Comput Optim Appl 66, 345–381 (2017). https://doi.org/10.1007/s10589-016-9871-8
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-016-9871-8