Sparse learning via Boolean relaxations

Pilanci, Mert; Wainwright, Martin J.; El Ghaoui, Laurent

doi:10.1007/s10107-015-0894-1

Sparse learning via Boolean relaxations

Full Length Paper
Series B
Published: 31 March 2015

Volume 151, pages 63–87, (2015)
Cite this article

Mathematical Programming Submit manuscript

Mert Pilanci¹,
Martin J. Wainwright² &
Laurent El Ghaoui¹

1251 Accesses
40 Citations
1 Altmetric
Explore all metrics

Abstract

We introduce novel relaxations for cardinality-constrained learning problems, including least-squares regression as a special but important case. Our approach is based on reformulating a cardinality-constrained problem exactly as a Boolean program, to which standard convex relaxations such as the Lasserre and Sherali-Adams hierarchies can be applied. We analyze the first-order relaxation in detail, deriving necessary and sufficient conditions for exactness in a unified manner. In the special case of least-squares regression, we show that these conditions are satisfied with high probability for random ensembles satisfying suitable incoherence conditions, similar to results on $\ell _1$-relaxations. In contrast to known methods, our relaxations yield lower bounds on the objective, and it can be verified whether or not the relaxation is exact. If it is not, we show that randomization based on the relaxed solution offers a principled way to generate provably good feasible solutions. This property enables us to obtain high quality estimates even if incoherence conditions are not met, as might be expected in real datasets. We numerically illustrate the performance of the relaxation-randomization strategy in both synthetic and real high-dimensional datasets, revealing substantial improvements relative to $\ell _1$-based methods and greedy selection heuristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

The Hadamard decomposition problem

Article Open access 21 May 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Notes

Taken from the Princeton University Gene Expression Project; for original source and further details please see the references therein.
Taken from FDA-NCI Clinical Proteomics Program Databank; for original source and further details please see the references therein.

References

Ahlswede, R., Winter, A.: Strong converse for identification via quantum channels. IEEE Trans. Inf. Theory 48(3), 569–579 (2002)
Article MATH MathSciNet Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd international symposium on information theory, Tsahkadsor, Armenia, USSR (September 1971)
Bickel, P.J., Ritov, Y., Tsybakov, A.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)
Article MATH MathSciNet Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge, UK (2004)
Book MATH Google Scholar
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer Series in Statistics. Springer, Berlin (2011)
Book Google Scholar
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Info. Theory 51(12), 4203–4215 (2005)
Article MATH MathSciNet Google Scholar
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel Based Learning Methods). Cambridge University Press, Cambridge (2000)
Book Google Scholar
Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0 (August 2012)
d’Aspremont, A., El Ghaoui, L.: Testing the nullspace property using semidefinite programming. Technical report, Princeton (2009)
Davidson, K.R., Szarek, S.J.: Local operator theory, random matrices and Banach spaces. Handbook of Banach Spaces, vol. 1, pp. 317–336. Elsevier, Amsterdam (2001)
Google Scholar
Dekel, O., Singer, Y.: Support vector machines on a budget. Adv. Neural Inf. Process. Syst. 19, 345 (2007)
Google Scholar
Donoho, D.L.: Compress. sensing. IEEE Trans. Info. Theory 52(4), 1289–1306 (2006)
Article MATH MathSciNet Google Scholar
Donoho, D.L., Elad, M., Temlyakov, V.M.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2006)
Article MATH MathSciNet Google Scholar
Fuchs, J.J.: Recovery of exact sparse representations in the presence of noise. ICASSP 2, 533–536 (2004)
Google Scholar
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer, Berlin (2008)
Google Scholar
Lasserre, J.B.: An explicit exact SDP relaxation for nonlinear 0–1 programs. In: Aardal K., and Gerads A.M.H., (eds.) Lecture Notes in Computer Science, 2081:293–303 (2001)
Laurent, M.: A comparison of the Sherali-Adams, Lovász-Schrijver and Lasserre relaxations for 0–1 programming. Math. Oper. Res. 28, 470–496 (2003)
Article MATH MathSciNet Google Scholar
Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI (2001)
MATH Google Scholar
Lovász, L., Schrijver, A.: Cones of matrices and set-functions and 0–1 optimization. SIAM J. Optim. 1, 166–190 (1991)
Article MATH MathSciNet Google Scholar
Markowitz, H.M.: Portf. Sel. Wiley, New York (1959)
Google Scholar
McCullagh, P., Nelder, J.A.: Generalized Linear Models. Monographs on Statistics and Applied Probability 37. Chapman and Hall/CRC, New York (1989)
Google Scholar
Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 34, 1436–1462 (2006)
Article MATH Google Scholar
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (1995)
Book MATH Google Scholar
Negahban, S., Ravikumar, P., Wainwright, M.J., Yu, B.: Restricted strong convexity and generalized linear models. Technical report, UC Berkeley, Department of Statistics (August 2011)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)
Oliveira, R.I.: Sums of random Hermitian matrices and an inequality by Rudelson. Elec. Comm. Prob. 15, 203–212 (2010)
Article MATH Google Scholar
Pilanci, M., El Ghaoui, L., Chandrasekaran, V.: Recovery of sparse probability measures via convex programming. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2420–2428. Curran Associates, Inc. (2012)
Schmidt, M., van den Berg, E., Friedlander, M., Murphy, K.: Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. AISTATS 2009, 5 (2009)
Sherali, H.D., Adams, W.P.: A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM J. Discrete Math. 3, 411–430 (1990)
Article MATH MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
MATH MathSciNet Google Scholar
Tropp, J.A.: Just relax: Convex programming methods for subset selection and sparse approximation. ICES Report 04–04, UT-Austin, February (2004)
Wainwright, M.J.: Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Info. Theory 55, 5728–5741 (2009)
Article MathSciNet Google Scholar
Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell _1$-constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55, 2183–2202 (2009)
Article MathSciNet Google Scholar
Wainwright, M.J.: Structured regularizers: statistical and computational issues. Annu. Rev. Stat. Appl. 1, 233–253 (2014)
Google Scholar
Wainwright, M.J., Jordan, M.I.: Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations. Technical report, UC Berkeley, Department of Statistics, No. 671 (September 2004)
Wasserman, Larry: Bayesian model selection and model averaging. J. Math. Psychol. 44(1), 92–107 (2000)
Article MATH MathSciNet Google Scholar
Zhang, Y., Wainwright, M.J., Jordan, M.I.: Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In COLT conference, Barcelona, Spain, (June 2014). Full length version at http://arxiv.org/abs/1402.1918
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2567 (2006)
MATH MathSciNet Google Scholar
Zou, H., Hastie, T.J.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

Authors MP and MJW were partially supported by Office of Naval Research MURI grant N00014-11-1-0688, and National Science Foundation Grants CIF-31712-23800 and DMS-1107000. In addition, MP was supported by a Microsoft Research Fellowship.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
Mert Pilanci & Laurent El Ghaoui
Department of Electrical Engineering and Computer Sciences and Department of Statistics, University of California, Berkeley, CA, USA
Martin J. Wainwright

Authors

Mert Pilanci
View author publications
You can also search for this author in PubMed Google Scholar
Martin J. Wainwright
View author publications
You can also search for this author in PubMed Google Scholar
Laurent El Ghaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurent El Ghaoui.

Appendix: Proofs

In this appendix, we provide the proofs of Theorems 2 and 3.

1.1 Proof of Theorem 2

Recalling the Definition (20) of the matrix $M$, for each $j \in \{1, \ldots , d\}$, define the rescaled random variable $U_j : \, =\frac{X_j^T My}{\rho n}$. In terms of this notation, it suffices to find a scalar $\lambda $ such that

$$\begin{aligned} \min _{j \in S} |U_j| > \lambda \quad \text{ and } \quad \max _{j \in S^c} |U_j| < \lambda . \end{aligned}$$

(32)

By definition, we have $y = X_Sw^*_S+ \varepsilon $, whence

$$\begin{aligned} U_j&= \underbrace{\frac{X_j^T MX_Sw^*_S}{\rho n}}_{A_j} \quad + \quad \underbrace{\frac{X_j^T M\varepsilon }{\rho n}}_{B_j}. \end{aligned}$$

Based on this decomposition, we then make the following claims:

Lemma 1

There are numerical constants $c_1, c_2$ such that

$$\begin{aligned} \mathbb {P}\big [ \max _{j = 1, \ldots , d} |B_j| \ge t \big ]&\le c_1 e^{-c_2 \frac{n\, t^2}{\gamma ^2} + \log d}. \end{aligned}$$

(33)

Lemma 2

There are numerical constants $c_1, c_2$ such that

$$\begin{aligned} \mathbb {P}\big [ \min _{j \in S} |A_j| < \frac{w_{}\mathrm{min}}{4} \big ]&\le c_1 e^{- c_2 n\frac{w_{}\mathrm{min}^2}{\Vert w^*_S\Vert _2^2}+\log (2k)} \quad \text{ and } \end{aligned}$$

(34a)

$$\begin{aligned} \mathbb {P}\big [ \max _{j \in S^c} |A_j| \ge \frac{w_{}\mathrm{min}}{16} \big ]&\le c_3 e^{- c_4 n\frac{w_{}\mathrm{min}^2}{\Vert w^*_S\Vert _2^2} + \log (d- k)}, \end{aligned}$$

(34b)

Using these two lemmas, we can now complete the proof. Recall that Theorem 2 assumes a lower bound of the form $n> c_0 \frac{\gamma ^2 + \Vert w^*_S\Vert _2^2}{w_{}\mathrm{min}^2} \, \log d$, where $c_0$ is a sufficiently large constant. Thus, setting $t = \frac{w_{}\mathrm{min}}{16}$ in Lemma 1 ensures that $\max \nolimits _{j = 1, \ldots , d} |B_j| \le \frac{w_{}\mathrm{min}}{16}$ with high probability. Combined with the bound (34a) from Lemma 2, we are guaranteed that

$$\begin{aligned} \min _{j \in S} |U_j|&\ge \frac{w_{}\mathrm{min}}{4} - \frac{w_{}\mathrm{min}}{16} \; = \; \frac{3 w_{}\mathrm{min}}{16} \qquad \text{ with } \text{ high } \text{ probability }. \end{aligned}$$

Similarly, the bound (34b) guarantees that

$$\begin{aligned} \max _{j \in S^c} |U_j|&\le \frac{w_{}\mathrm{min}}{16} + \frac{w_{}\mathrm{min}}{16} = \frac{2 w_{}\mathrm{min}}{16} \qquad \text{ also } \text{ with } \text{ high } \text{ probability. } \end{aligned}$$

Thus, setting $\lambda = \frac{5 w_{}\mathrm{min}}{32}$ ensures that the condition (32) holds.

The only remaining detail is to prove the two lemmas.

Proof of Lemma 1

Define the event $\mathcal {E}_j = \{ \Vert X_j\Vert _2/\sqrt{n} \le 2 \}$, and observe that

$$\begin{aligned} \mathbb {P}\big [ |B_j| > t \big ]&\le \mathbb {P}[|B_j| > t \mid \mathcal {E}] + \mathbb {P}[\mathcal {E}^c]. \end{aligned}$$

Since the variable $\Vert X_j\Vert _2^2$ follows a $\chi ^2$-distribution with $n$ degrees of freedom, we have $\mathbb {P}\big [ \mathcal {E}^c \big ] \le 2 e^{-c_2 n}$. Recalling the Definition (20) of the matrix $M$, note that $\sigma _{\mathrm{max}}\big (M\big ) \le \rho ^{-1}$, whence conditioned on $\mathcal {E}$, we have $\Vert MX_j\Vert _2 \le \Vert X_j\Vert _2 \le 2\sqrt{n}$. Consequently, conditioned on $\mathcal {E}$, the variable $\frac{X_j^T M\varepsilon }{\rho }$ is a Gaussian random vector with variance at most $4 \gamma ^2/\rho ^2$, and hence $\mathbb {P}[|B_j| > t \mid \mathcal {E}] \le 2 e^{-\frac{ \rho ^2 t^2}{32 \gamma ^2}}$.

Finally, by union bound, we have

$$\begin{aligned} \mathbb {P}\big [ \max _{j = 1, \ldots , d} |B_j| > t \big ]&\!\le \! d\, \mathbb {P}\big [ |B_j| > t \big ] \; \!\le \! \; d\Big \{ 2 e^{-\frac{\rho ^2 t^2}{32 \gamma ^2}} + 2 e^{-c_2 \rho n} \Big \} \; \!\le \! \; c_1 e^{-c_2 \frac{\rho ^2 t^2}{\gamma ^2} + \log d}, \end{aligned}$$

as claimed. $\square $

Proof of Lemma 2

We split the proof into two parts.

(1) Proof of the bound (34a):

Note that

$$\begin{aligned} \frac{1}{\rho }X_S^T MX_S&= X_S^T(\rho I_n+X_SX_S^T)^{-1}X_S\end{aligned}$$

We now write $X_S = UDV^T$ for singular value decomposition of $\frac{1}{\sqrt{n}}X_S$ in compact form. We thus have

$$\begin{aligned} \frac{1}{\rho } X_S^T MX_S&= V \big ( \rho I_n + nD^2 \big )^{-1} D^2 V^T. \end{aligned}$$

We will prove that for a fixed vector $z$, the following holds with high probability

$$\begin{aligned} \frac{\Vert \left( \frac{1}{\rho }X_S^T M X_S-I\right) z\Vert _\infty }{\Vert z\Vert _\infty } \le \epsilon . \end{aligned}$$

(35)

Applying the above bound to $w_S^*$, which is a fixed vector we obtain

$$\begin{aligned} \Vert \left( \frac{1}{\rho }X_S^T M X_S-I\right) w_s^*\Vert _\infty \le \epsilon \Vert w_s^*\Vert _\infty \end{aligned}$$

(36)

Then by triangle inequality the above statement implies that

$$\begin{aligned} \min _{i \in S} | \frac{1}{\rho }X_S^T M X_S w_i^* | > (1-\epsilon ) \min _{i\in S} |w_i^*|. \end{aligned}$$

and setting $\epsilon =3/4$ yields the claim.

Next we let $\frac{1}{\rho }X_S^T M X_S-I = V \tilde{D} V$ where we defined $\tilde{D} \!: \, =\! \left( (\rho I_n\!+\!D^2)^{-1}D^2\!-\!I\right) $. By standard results on operator norm of Gaussian random matrices (e.g., see Davidson and Szarek [10]), the minimum singular valyue

$$\begin{aligned} \sigma _{\min }\left( \frac{1}{\sqrt{n}}\mathrm{{X}}_S\right)&= \min _{i=1, \ldots , k} D_{ii} \end{aligned}$$

of the matrix $\mathrm{{X}}_S/\sqrt{n}$ can be bounded as

$$\begin{aligned} \mathbb {P}\left[ \frac{1}{\sqrt{n}} \min _{i=1,\ldots ,k} |D_{ii}| \le 1 - \sqrt{\frac{k}{n}} - t \right]&\le 2 e^{-c_1 nt^2}, \end{aligned}$$

(37)

where $c_1$ is a numerical constant (independent of $(n, k)$).

Now define $Y_i\,:= e_i^T V \tilde{D} V^T z = z_i v_i \tilde{D} v_i + v_i^T \tilde{D} \sum _{l\ne i} z_l v_l$. Then note that,

$$\begin{aligned} |Y_1|&\le \Vert \tilde{D}\Vert _2 |z_1| + v_1^T \tilde{D} \sum _{l\ne i} z_l v_l \nonumber = \frac{\rho }{\rho + \min _{i=1, \ldots ,k} |D_{ii}|^2} |z_1| + F(v_1) \end{aligned}$$

where we defined $F(v_1): \, =v_1^T \tilde{D} \sum _{l\ne i} z_l v_l$ and $v_1$ is uniformly distributed over a sphere in $k-1$ dimensions and hence $\mathbb {E} F(v_1) = 0$. Observe that $F$ is a Lipschitz map satisfying

$$\begin{aligned} |F(v_1)-F(v_1')|&\le \Vert \tilde{D}\Vert _{\infty } \sqrt{\sum _{l\ne i} |z_l^2} |v_1-v_1'\Vert _2\\ {}&= \frac{\rho }{\rho + \min _{i} |D_{ii}|^2} |\sqrt{k-1} \Vert z\Vert _\infty \Vert v_1-v_1'\Vert _2 \end{aligned}$$

Applying concentration of measure for Lipschitz functions on the sphere (e.g., see [18]) the function $F(v_1)$ we get that for all $t>0$ we have,

$$\begin{aligned} \mathbb {P}\big [ F(v_1) > t\Vert z\Vert _\infty \big ]&\le 2 e^{ -c_4(k-1) \frac{t^2}{\big (\frac{\rho }{\rho + \min _{i} |D_{ii}|^2} \big )^2 (k-1)}}. \end{aligned}$$

(38)

Conditioning on the high probability event $\{\min _{i} |D_{ii}|^2 \le \frac{n}{2}\}$ and then applying the tail bound (37) yields

$$\begin{aligned} \mathbb {P}\big [ F(v_1) > t\Vert z\Vert _\infty \big ]&\le 2 \exp {\left( -c_4 \frac{n^2t^2}{\rho ^2} \right) } + 2 e^{-c_2 \frac{nt^2}{\rho ^2}} \nonumber \nonumber \\&\le 4 e^{-c_5 \frac{n^2 t^2}{\rho ^2}} \,. \end{aligned}$$

(39)

Combining the pieces in (39) and (38), we take a union bound over $2k$ coordinates,

$$\begin{aligned} \mathbb {P}\left[ \min _{j\in S} |Y_j| > t \Vert z\Vert _\infty \right]&\le 2k \; 3\exp {\left( -c_5 n^2 t^2/\rho ^2 \right) } \\&\le 2k \; 3 \exp {\left( -c_5 nt^2 \right) }\,. \end{aligned}$$

where the final line follows from our choice $\rho = \sqrt{n}$. Finally setting $t=\epsilon $ we obtain the statement in (35) and hence complete the proof.

Proof of the bound (34b): A similar calculation yields

$$\begin{aligned} A_j = \frac{1}{\rho }X_{S^c}^T MX_{S}w_S^*&= X_{S^c}^T\big ( \rho I_{n} + X_SX_S^T \big )^{-1} X_s w_S^*\,, \end{aligned}$$

for each $j \in S^c$.

Defining the event $\mathcal {E}= \{ \sigma _{\mathrm{max}}\big (X_S\big )/ \le 2 \sqrt{n} \}$, standard bounds in random matrix theory [10] imply that $\mathbb {P}[\mathcal {E}^c] \le 2 e^{-c_2 n}$. Conditioned on $\mathcal {E}$, we have

$$\begin{aligned} \Vert \big ( \rho I_{n} + X_SX_S^T \big )^{-1} X_s w^*_S\Vert _2 \le \frac{2}{\rho } \Vert w^*_S\Vert _2, \end{aligned}$$

so that the variable $A_j$ is conditionally Gaussian with variance at most $\frac{4}{\rho ^2} \Vert w^*_S\Vert _2^2$. Consequently, we have

$$\begin{aligned} \mathbb {P}[|A_j| \ge t]&\le \mathbb {P}[|A_j| \ge t \mid \mathcal {E}] + \mathbb {P}[\mathcal {E}^c] \; = \; 2 e^{- \frac{\rho ^2 t^2}{32 \Vert w^*_S\Vert _2^2}} + 2 e^{-c_2 } \le c_1 e^{-c_2 \frac{\rho ^2 t^2}{\Vert w^*_S\Vert _2^2}}, \end{aligned}$$

Setting $t = \frac{w_{}\mathrm{min}}{8}$, $\rho = \sqrt{n}$ and taking union bound over all $d- k$ indices in $S^c$ yields the claim (34b). $\square $

1.2 Proof of Theorem 3

The vector $\widetilde{u}\in \{0,1\}^d$ consists of independent Bernoulli trials, and we have $\mathbb {E}[\sum _{j=1}^d\widetilde{u}_j] \le k$. Consequently, by the Chernoff bound for Bernoulli sums, we have

$$\begin{aligned} \mathbb {P}\Big [ \sum _{j=1}^d\widetilde{u}_j \ge (1 + \delta ) k\Big ]&\le c_1 e^{-c_2 k\delta ^2}. \end{aligned}$$

as claimed.

It remains to establish the high-probability bound on the optimal value. As shown previously, the Boolean problem admits the saddle point representation

$$\begin{aligned} P^*&= \min _{u \in \{0,1\}^d,~\sum _{i=1}^du_i \le k} \Big \{ \underbrace{\max _{\alpha \in \mathbb {R}^n} -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(u)\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y}_{G(u)} \Big \}. \end{aligned}$$

(40)

Since the optimal value is non-negative, the optimal dual parameter $\alpha \in \mathbb {R}^n$ must have its $\ell _2$-norm bounded as $\Vert \alpha \Vert _2\le 2 \Vert y\Vert _2\le 2$. Using this fact, we have

$$\begin{aligned} G(\widehat{u}) - G(\widetilde{u}) =&\max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(\widehat{u})\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y \Big \} \\&\quad - \max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}D(\widetilde{u})\mathrm{{X}}^T\alpha - \Vert \alpha \Vert ^2_2 - 2 \alpha ^T y \Big \} \\ \le&\max _{\Vert \alpha \Vert _2 \le 2} \Big \{ -\frac{1}{\rho } \alpha ^T \mathrm{{X}}(D(\widehat{u})-D(\widetilde{u}))\mathrm{{X}}^T\alpha \Big \}\\ \le&\frac{2}{\rho } \sigma _{\mathrm{max}}\big (\mathrm{{X}}(D(\widehat{u})-D(\widetilde{u}))\mathrm{{X}}^T\big ), \end{aligned}$$

where $\sigma _{\mathrm{max}}\big (\cdot \big )$ denotes the maximum eigenvalue of a symmetric matrix.

It remains to establish a high probability bound on this maximum eigenvalue. Recall that $R$ is the subset of indices associated with fractional elements of $\widehat{u}$, and moreover that $\mathbb {E}[\widetilde{u}_j] = \widehat{u}_j$. Using these facts, we can write

$$\begin{aligned} \mathrm{{X}}(D(\widetilde{u}) - D(\widehat{u}))\mathrm{{X}}^T = \sum _{j \in R} \underbrace{\big ( \widetilde{u}_j - \mathbb {E}[\widetilde{u}_j] \big ) X_j X_j^T}_{A_j} \end{aligned}$$

where $X_j \in \mathbb {R}^{n}$ denotes the $j$th column of $\mathrm{{X}}$. Since $\Vert X_j\Vert _2 \le 1$ by assumption and $\widetilde{u}_j$ is Bernoulli, the matrix $A_j$ has operator norm at most 1, and is zero mean. Consequently, by the Ahlswede-Winter matrix bound [1, 26], we have

$$\begin{aligned} \mathbb {P}\Big [ \sigma _{\mathrm{max}}\big (\sum _{j \in R} A_j\big ) \ge \sqrt{r} t \Big ]&\le 2 \min \{ n, r\} e^{-t^2/16}, \end{aligned}$$

where $r= |R|$ is the number of fractional components. Setting $t^2 = c \log \min \{n, r\}$ for a sufficiently large constant $c$ yields the claim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pilanci, M., Wainwright, M.J. & El Ghaoui, L. Sparse learning via Boolean relaxations. Math. Program. 151, 63–87 (2015). https://doi.org/10.1007/s10107-015-0894-1

Download citation

Received: 04 November 2014
Accepted: 17 February 2015
Published: 31 March 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10107-015-0894-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse learning via Boolean relaxations

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

The Hadamard decomposition problem

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

1.1 Proof of Theorem 2

Lemma 1

Lemma 2

Proof of Lemma 1

Proof of Lemma 2

1.2 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Sparse learning via Boolean relaxations

Abstract

Access this article

Similar content being viewed by others

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

The Hadamard decomposition problem

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

Appendix: Proofs

1.1 Proof of Theorem 2

Lemma 1

Lemma 2

Proof of Lemma 1

Proof of Lemma 2

1.2 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation