Abstract
We introduce novel relaxations for cardinality-constrained learning problems, including least-squares regression as a special but important case. Our approach is based on reformulating a cardinality-constrained problem exactly as a Boolean program, to which standard convex relaxations such as the Lasserre and Sherali-Adams hierarchies can be applied. We analyze the first-order relaxation in detail, deriving necessary and sufficient conditions for exactness in a unified manner. In the special case of least-squares regression, we show that these conditions are satisfied with high probability for random ensembles satisfying suitable incoherence conditions, similar to results on \(\ell _1\)-relaxations. In contrast to known methods, our relaxations yield lower bounds on the objective, and it can be verified whether or not the relaxation is exact. If it is not, we show that randomization based on the relaxed solution offers a principled way to generate provably good feasible solutions. This property enables us to obtain high quality estimates even if incoherence conditions are not met, as might be expected in real datasets. We numerically illustrate the performance of the relaxation-randomization strategy in both synthetic and real high-dimensional datasets, revealing substantial improvements relative to \(\ell _1\)-based methods and greedy selection heuristics.
Similar content being viewed by others
Notes
Taken from the Princeton University Gene Expression Project; for original source and further details please see the references therein.
Taken from FDA-NCI Clinical Proteomics Program Databank; for original source and further details please see the references therein.
References
Ahlswede, R., Winter, A.: Strong converse for identification via quantum channels. IEEE Trans. Inf. Theory 48(3), 569–579 (2002)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd international symposium on information theory, Tsahkadsor, Armenia, USSR (September 1971)
Bickel, P.J., Ritov, Y., Tsybakov, A.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge, UK (2004)
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer Series in Statistics. Springer, Berlin (2011)
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Info. Theory 51(12), 4203–4215 (2005)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel Based Learning Methods). Cambridge University Press, Cambridge (2000)
Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0 (August 2012)
d’Aspremont, A., El Ghaoui, L.: Testing the nullspace property using semidefinite programming. Technical report, Princeton (2009)
Davidson, K.R., Szarek, S.J.: Local operator theory, random matrices and Banach spaces. Handbook of Banach Spaces, vol. 1, pp. 317–336. Elsevier, Amsterdam (2001)
Dekel, O., Singer, Y.: Support vector machines on a budget. Adv. Neural Inf. Process. Syst. 19, 345 (2007)
Donoho, D.L.: Compress. sensing. IEEE Trans. Info. Theory 52(4), 1289–1306 (2006)
Donoho, D.L., Elad, M., Temlyakov, V.M.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2006)
Fuchs, J.J.: Recovery of exact sparse representations in the presence of noise. ICASSP 2, 533–536 (2004)
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008)
Lasserre, J.B.: An explicit exact SDP relaxation for nonlinear 0–1 programs. In: Aardal K., and Gerads A.M.H., (eds.) Lecture Notes in Computer Science, 2081:293–303 (2001)
Laurent, M.: A comparison of the Sherali-Adams, Lovász-Schrijver and Lasserre relaxations for 0–1 programming. Math. Oper. Res. 28, 470–496 (2003)
Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI (2001)
Lovász, L., Schrijver, A.: Cones of matrices and set-functions and 0–1 optimization. SIAM J. Optim. 1, 166–190 (1991)
Markowitz, H.M.: Portf. Sel. Wiley, New York (1959)
McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monographs on Statistics and Applied Probability 37. Chapman and Hall/CRC, New York (1989)
Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34, 1436–1462 (2006)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (1995)
Negahban, S., Ravikumar, P., Wainwright, M.J., Yu, B.: Restricted strong convexity and generalized linear models. Technical report, UC Berkeley, Department of Statistics (August 2011)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)
Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Elec. Comm. Prob. 15, 203–212 (2010)
Pilanci, M., El Ghaoui, L., Chandrasekaran, V.: Recovery of sparse probability measures via convex programming. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2420–2428. Curran Associates, Inc. (2012)
Schmidt, M., van den Berg, E., Friedlander, M., Murphy, K.: Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. AISTATS 2009, 5 (2009)
Sherali, H.D., Adams, W.P.: A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM J. Discrete Math. 3, 411–430 (1990)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Tropp, J.A.: Just relax: Convex programming methods for subset selection and sparse approximation. ICES Report 04–04, UT-Austin, February (2004)
Wainwright, M.J.: Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Info. Theory 55, 5728–5741 (2009)
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55, 2183–2202 (2009)
Wainwright, M.J.: Structured regularizers: statistical and computational issues. Annu. Rev. Stat. Appl. 1, 233–253 (2014)
Wainwright, M.J., Jordan, M.I.: Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations. Technical report, UC Berkeley, Department of Statistics, No. 671 (September 2004)
Wasserman, Larry: Bayesian model selection and model averaging. J. Math. Psychol. 44(1), 92–107 (2000)
Zhang, Y., Wainwright, M.J., Jordan, M.I.: Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In COLT conference, Barcelona, Spain, (June 2014). Full length version at http://arxiv.org/abs/1402.1918
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2567 (2006)
Zou, H., Hastie, T.J.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)
Acknowledgments
Authors MP and MJW were partially supported by Office of Naval Research MURI grant N00014-11-1-0688, and National Science Foundation Grants CIF-31712-23800 and DMS-1107000. In addition, MP was supported by a Microsoft Research Fellowship.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
In this appendix, we provide the proofs of Theorems 2 and 3.
1.1 Proof of Theorem 2
Recalling the Definition (20) of the matrix \(M\), for each \(j \in \{1, \ldots , d\}\), define the rescaled random variable \(U_j : \, =\frac{X_j^T My}{\rho n}\). In terms of this notation, it suffices to find a scalar \(\lambda \) such that
By definition, we have \(y = X_Sw^*_S+ \varepsilon \), whence
Based on this decomposition, we then make the following claims:
Lemma 1
There are numerical constants \(c_1, c_2\) such that
Lemma 2
There are numerical constants \(c_1, c_2\) such that
Using these two lemmas, we can now complete the proof. Recall that Theorem 2 assumes a lower bound of the form \(n> c_0 \frac{\gamma ^2 + \Vert w^*_S\Vert _2^2}{w_{}\mathrm{min}^2} \, \log d\), where \(c_0\) is a sufficiently large constant. Thus, setting \(t = \frac{w_{}\mathrm{min}}{16}\) in Lemma 1 ensures that \(\max \nolimits _{j = 1, \ldots , d} |B_j| \le \frac{w_{}\mathrm{min}}{16}\) with high probability. Combined with the bound (34a) from Lemma 2, we are guaranteed that
Similarly, the bound (34b) guarantees that
Thus, setting \(\lambda = \frac{5 w_{}\mathrm{min}}{32}\) ensures that the condition (32) holds.
The only remaining detail is to prove the two lemmas.
Proof of Lemma 1
Define the event \(\mathcal {E}_j = \{ \Vert X_j\Vert _2/\sqrt{n} \le 2 \}\), and observe that
Since the variable \(\Vert X_j\Vert _2^2\) follows a \(\chi ^2\)-distribution with \(n\) degrees of freedom, we have \(\mathbb {P}\big [ \mathcal {E}^c \big ] \le 2 e^{-c_2 n}\). Recalling the Definition (20) of the matrix \(M\), note that \(\sigma _{\mathrm{max}}\big (M\big ) \le \rho ^{-1}\), whence conditioned on \(\mathcal {E}\), we have \(\Vert MX_j\Vert _2 \le \Vert X_j\Vert _2 \le 2\sqrt{n}\). Consequently, conditioned on \(\mathcal {E}\), the variable \(\frac{X_j^T M\varepsilon }{\rho }\) is a Gaussian random vector with variance at most \(4 \gamma ^2/\rho ^2\), and hence \(\mathbb {P}[|B_j| > t \mid \mathcal {E}] \le 2 e^{-\frac{ \rho ^2 t^2}{32 \gamma ^2}}\).
Finally, by union bound, we have
as claimed. \(\square \)
Proof of Lemma 2
We split the proof into two parts.
(1) Proof of the bound (34a):
Note that
We now write \(X_S = UDV^T\) for singular value decomposition of \(\frac{1}{\sqrt{n}}X_S\) in compact form. We thus have
We will prove that for a fixed vector \(z\), the following holds with high probability
Applying the above bound to \(w_S^*\), which is a fixed vector we obtain
Then by triangle inequality the above statement implies that
and setting \(\epsilon =3/4\) yields the claim.
Next we let \(\frac{1}{\rho }X_S^T M X_S-I = V \tilde{D} V\) where we defined \(\tilde{D} \!: \, =\! \left( (\rho I_n\!+\!D^2)^{-1}D^2\!-\!I\right) \). By standard results on operator norm of Gaussian random matrices (e.g., see Davidson and Szarek [10]), the minimum singular valyue
of the matrix \(\mathrm{{X}}_S/\sqrt{n}\) can be bounded as
where \(c_1\) is a numerical constant (independent of \((n, k)\)).
Now define \(Y_i\,:= e_i^T V \tilde{D} V^T z = z_i v_i \tilde{D} v_i + v_i^T \tilde{D} \sum _{l\ne i} z_l v_l\). Then note that,
where we defined \(F(v_1): \, =v_1^T \tilde{D} \sum _{l\ne i} z_l v_l\) and \(v_1\) is uniformly distributed over a sphere in \(k-1\) dimensions and hence \(\mathbb {E} F(v_1) = 0\). Observe that \(F\) is a Lipschitz map satisfying
Applying concentration of measure for Lipschitz functions on the sphere (e.g., see [18]) the function \(F(v_1)\) we get that for all \(t>0\) we have,
Conditioning on the high probability event \(\{\min _{i} |D_{ii}|^2 \le \frac{n}{2}\}\) and then applying the tail bound (37) yields
Combining the pieces in (39) and (38), we take a union bound over \(2k\) coordinates,
where the final line follows from our choice \(\rho = \sqrt{n}\). Finally setting \(t=\epsilon \) we obtain the statement in (35) and hence complete the proof.
Proof of the bound (34b): A similar calculation yields
for each \(j \in S^c\).
Defining the event \(\mathcal {E}= \{ \sigma _{\mathrm{max}}\big (X_S\big )/ \le 2 \sqrt{n} \}\), standard bounds in random matrix theory [10] imply that \(\mathbb {P}[\mathcal {E}^c] \le 2 e^{-c_2 n}\). Conditioned on \(\mathcal {E}\), we have
so that the variable \(A_j\) is conditionally Gaussian with variance at most \(\frac{4}{\rho ^2} \Vert w^*_S\Vert _2^2\). Consequently, we have
Setting \(t = \frac{w_{}\mathrm{min}}{8}\), \(\rho = \sqrt{n}\) and taking union bound over all \(d- k\) indices in \(S^c\) yields the claim (34b). \(\square \)
1.2 Proof of Theorem 3
The vector \(\widetilde{u}\in \{0,1\}^d\) consists of independent Bernoulli trials, and we have \(\mathbb {E}[\sum _{j=1}^d\widetilde{u}_j] \le k\). Consequently, by the Chernoff bound for Bernoulli sums, we have
as claimed.
It remains to establish the high-probability bound on the optimal value. As shown previously, the Boolean problem admits the saddle point representation
Since the optimal value is non-negative, the optimal dual parameter \(\alpha \in \mathbb {R}^n\) must have its \(\ell _2\)-norm bounded as \(\Vert \alpha \Vert _2\le 2 \Vert y\Vert _2\le 2\). Using this fact, we have
where \(\sigma _{\mathrm{max}}\big (\cdot \big )\) denotes the maximum eigenvalue of a symmetric matrix.
It remains to establish a high probability bound on this maximum eigenvalue. Recall that \(R\) is the subset of indices associated with fractional elements of \(\widehat{u}\), and moreover that \(\mathbb {E}[\widetilde{u}_j] = \widehat{u}_j\). Using these facts, we can write
where \(X_j \in \mathbb {R}^{n}\) denotes the \(j\)th column of \(\mathrm{{X}}\). Since \(\Vert X_j\Vert _2 \le 1\) by assumption and \(\widetilde{u}_j\) is Bernoulli, the matrix \(A_j\) has operator norm at most 1, and is zero mean. Consequently, by the Ahlswede-Winter matrix bound [1, 26], we have
where \(r= |R|\) is the number of fractional components. Setting \(t^2 = c \log \min \{n, r\}\) for a sufficiently large constant \(c\) yields the claim.
Rights and permissions
About this article
Cite this article
Pilanci, M., Wainwright, M.J. & El Ghaoui, L. Sparse learning via Boolean relaxations. Math. Program. 151, 63–87 (2015). https://doi.org/10.1007/s10107-015-0894-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-015-0894-1