Solving sparse principal component analysis with global support

Dey, Santanu S.; Molinaro, Marco; Wang, Guanyi

doi:10.1007/s10107-022-01857-w

Solving sparse principal component analysis with global support

Full Length Paper
Series A
Published: 19 July 2022

Volume 199, pages 421–459, (2023)
Cite this article

Mathematical Programming Submit manuscript

653 Accesses
Explore all metrics

Abstract

Sparse principal component analysis with global support (SPCAgs), is the problem of finding the top-r leading principal components such that all these principal components are linear combinations of a common subset of at most k variables. SPCAgs is a popular dimension reduction tool in statistics that enhances interpretability compared to regular principal component analysis (PCA). Methods for solving SPCAgs in the literature are either greedy heuristics (in the special case of $r = 1$) with guarantees under restrictive statistical models or algorithms with stationary point convergence for some regularized reformulation of SPCAgs. Crucially, none of the existing computational methods can efficiently guarantee the quality of the solutions obtained by comparing them against dual bounds. In this work, we first propose a convex relaxation based on operator norms that provably approximates the feasible region of SPCAgs within a $c_1 + c_2 \sqrt{\log r} = O(\sqrt{\log r})$ factor for some constants $c_1, c_2$. To prove this result, we use a novel random sparsification procedure that uses the Pietsch-Grothendieck factorization theorem and may be of independent interest. We also propose a simpler relaxation that is second-order cone representable and gives a $(2\sqrt{r})$-approximation for the feasible region. Using these relaxations, we then propose a convex integer program that provides a dual bound for the optimal value of SPCAgs. Moreover, it also has worst-case guarantees: it is within a multiplicative/additive factor of the original optimal value, and the multiplicative factor is $O(\log r)$ or O(r) depending on the relaxation used. Finally, we conduct computational experiments that show that our convex integer program provides, within a reasonable time, good upper bounds that are typically significantly better than the natural baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse principal component analysis subject to prespecified cardinality of loadings

Article 22 July 2015

Certifiably optimal sparse principal component analysis

Article 01 January 2019

Sparse Principal Component Analysis with Missing Observations

Notes

For some intuition: The first term in the parenthesis controls the variance of $\widetilde{\varvec{W}}_{ii}$, which is $\text {Var}(\widetilde{\varvec{W}}_{ii}) \le \frac{\varvec{W}^2_{ii}}{p_i} \le \frac{6}{k}$; the second term controls the largest size of a row of $\widetilde{\varvec{V}}$, which is $\Vert \widetilde{\varvec{V}}_{i,:}\Vert _2 \le \left\| \frac{\varvec{V}^*_{i,:}}{p_i} \right\| _2 \le \frac{6}{k} \sum _{i'} \Vert \varvec{V}^*_{i',:}\Vert _2$, which is at most $6\sqrt{\frac{r}{k}} \le 6$ because $\varvec{V}^* \in \mathcal {C}\mathcal {R}1$.
Formally, we can append zero rows to $\frac{\varvec{V}^*_i}{\Vert \varvec{V}^*_i\Vert _2}$ so that the resulting matrix is in $\mathcal {C}\mathcal {R}1$.
Another idea we explore is to find a principal submatrix whose determinant is near-maximal using a greedy algorithm, see Algorithm 2 in Appendix C.3. More on this in Sect. 5.1.2.

References

Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503 (2000)
Article Google Scholar
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. National Academy Sci. 96(12), 6745–6750 (1999)
Article Google Scholar
Asteris, M., Papailiopoulos, D., Kyrillidis, A., Dimakis, A.G.: Sparse PCA via bipartite matchings. In: Advances in Neural Information Processing Systems, pp. 766–774 (2015)
Asteris, M., Papailiopoulos, D.S., Karystinos, G.N.: Sparse principal component of a rank-deficient matrix. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp. 673–677. IEEE (2011)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Article MathSciNet MATH Google Scholar
Berthet, Q., Rigollet, P.: Computational lower bounds for sparse pca. arXiv:1304.0828 (2013)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet MATH Google Scholar
Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Sparse features for PCA-like linear regression. In: Advances in Neural Information Processing Systems, pp. 2285–2293 (2011)
Burgel, P.R., Paillasseur, J., Caillaud, D., Tillie-Leblond, I., Chanez, P., Escamilla, R., Perez, T., Carré, P., Roche, N., et al.: Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur. Respiratory J. 36(3), 531–539 (2010)
Article Google Scholar
Cai, T., Ma, Z., Wu, Y.: Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields 161(3–4), 781–815 (2015)
Article MathSciNet MATH Google Scholar
Cai, T.T., Ma, Z., Wu, Y., et al.: Sparse PCA: Optimal rates and adaptive estimation. Ann. Stat. 41(6), 3074–3110 (2013)
Article MathSciNet MATH Google Scholar
Chan, S.O., Papailliopoulos, D., Rubinstein, A.: On the approximability of sparse PCA. In: Conference on Learning Theory, pp. 623–646 (2016)
Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse PCA and sparse CCA. arXiv:1903.11576 (2019)
d’Aspremont, A., Bach, F., El Ghaoui, L.: Approximation bounds for sparse principal component analysis. Math. Program. 148(1–2), 89–110 (2014)
Article MathSciNet MATH Google Scholar
d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9(Jul), 1269–1294 (2008)
MathSciNet MATH Google Scholar
d’Aspremont, A., Ghaoui, L.E., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. In: Advances in neural information processing systems, pp. 41–48 (2005)
Del Pia, A.: Sparse PCA on fixed-rank matrices. http://www.optimization-online.org/DB_HTML/2019/07/7307.html (2019)
Deshpande, Y., Montanari, A.: Sparse PCA via covariance thresholding. J. Mach. Learn. Res. 17(1), 4913–4953 (2016)
MathSciNet MATH Google Scholar
Dey, S.S., Mazumder, R., Wang, G.: A convex integer programming approach for optimal sparse pca. arXiv:1810.09062 (2018)
Erichson, N.B., Zheng, P., Manohar, K., Brunton, S.L., Kutz, J.N., Aravkin, A.Y.: Sparse principal component analysis via variable projection. arXiv:1804.00341 (2018)
Gallivan, K.A., Absil, P.: Note on the convex hull of the stiefel manifold. Technical note (2010)
Gu, Q., Wang, Z., Liu, H.: Sparse PCA with oracle property. In: Advances in neural information processing systems, pp. 1529–1537 (2014)
Hiriart-Urruty, J.B., Lemaréchal, C.: Fundamentals of convex analysis. Springer Science & Business Media (2012)
Johnstone, I.M., Lu, A.Y.: Sparse principal components analysis. arXiv:0901.4392 (2009)
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202 (2016)
Article MathSciNet MATH Google Scholar
Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)
Article MathSciNet Google Scholar
Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11(Feb), 517–553 (2010)
MathSciNet MATH Google Scholar
Kannan, R., Vempala, S.: Randomized algorithms in numerical linear algebra. Acta Numerica 26, 95 (2017)
Article MathSciNet MATH Google Scholar
Kim, J., Tawarmalani, M., Richard, J.P.P.: Convexification of permutation-invariant sets and applications. arXiv:1910.02573 (2019)
Krauthgamer, R., Nadler, B., Vilenchik, D., et al.: Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 43(3), 1300–1322 (2015)
Article MathSciNet MATH Google Scholar
Lei, J., Vu, V.Q., et al.: Sparsistency and agnostic inference in sparse PCA. Ann. Stat. 43(1), 299–322 (2015)
Article MathSciNet MATH Google Scholar
Ma, S.: Alternating direction method of multipliers for sparse principal component analysis. J. Oper. Res. Soc. China 1(2), 253–274 (2013)
Article MATH Google Scholar
Ma, T., Wigderson, A.: Sum-of-squares lower bounds for sparse pca. In: Advances in Neural Information Processing Systems, pp. 1612–1620 (2015)
Mackey, L.W.: Deflation methods for sparse PCA. In: Advances in neural information processing systems, pp. 1017–1024 (2009)
Magdon-Ismail, M.: NP-hardness and inapproximability of sparse PCA. Inf. Process. Lett. 126, 35–38 (2017)
Article MathSciNet MATH Google Scholar
Mitzenmacher, M., Upfal, E.: Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press (2017)
Papailiopoulos, D., Dimakis, A., Korokythakis, S.: Sparse PCA through low-rank approximations. In: International Conference on Machine Learning, pp. 747–755 (2013)
Pietsch, A.: Operator ideals, vol. 16. Deutscher Verlag der Wissenschaften (1978)
PROBEL, C.J., TROPP, J.A.: Technical report no. 2011-02 august 2011 (2011)
Sigg, C.D., Buhmann, J.M.: Expectation-maximization for sparse and non-negative PCA. In: Proceedings of the 25th international conference on Machine learning, pp. 960–967. ACM (2008)
Steinberg, D.: Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology 2 (2005)
Tropp, J.A.: Column subset selection, matrix factorization, and eigenvalue optimization. In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 978–986. SIAM (2009)
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)
Article MathSciNet MATH Google Scholar
Vu, V., Lei, J.: Minimax rates of estimation for sparse PCA in high dimensions. In: Artificial intelligence and statistics, pp. 1278–1286 (2012)
Vu, V.Q., Cho, J., Lei, J., Rohe, K.: Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In: Advances in neural information processing systems, pp. 2670–2678 (2013)
Wang, G., Dey, S.: Upper bounds for model-free row-sparse principal component analysis. In: Proceedings of the International Conference on Machine Learning (2020)
Wang, Z., Lu, H., Liu, H.: Tighten after relax: Minimax-optimal sparse PCA in polynomial time. In: Advances in neural information processing systems, pp. 3383–3391 (2014)
Wolsey, L.A., Nemhauser, G.L.: Integer and combinatorial optimization, vol. 55. Wiley, New york (1999)
MATH Google Scholar
Yeung, K.Y., Ruzzo, W.L.: Principal component analysis for clustering gene expression data. Bioinformatics 17(9), 763–774 (2001)
Article Google Scholar
Yongchun Li, W.X.: Exact and approximation algorithms for sparse PCA. http://www.optimization-online.org/DB_HTML/2020/05/7802.html (2020)
Yuan, X.T., Zhang, T.: Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 14(Apr), 899–925 (2013)
MathSciNet MATH Google Scholar
Zhang, Y., d’Aspremont, A., El Ghaoui, L.: Sparse PCA: Convex relaxations, algorithms and applications. In: Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 915–940. Springer (2012)
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for excellent comments that significantly improved the paper. In particular, the SDP presented in Sect. 2.2 and the heuristic method in Appendix B.3 have been suggested by the reviewers. Marco Molinaro was supported in part by the Coordenaćão de Aperfeićoamento de Pessoal de Nível Superior (CAPES, Brasil) - Finance Code 001, by Bolsa de Produtividade em Pesquisa $\#3$12751/2021-4 from CNPq, FAPERJ grant “Jovem Cientista do Nosso Estado”, and by the CAPES-PrInt program. Santanu S. Dey would like to gratefully acknowledge the support of the grant N000141912323 from ONR.

Author information

Authors and Affiliations

ISyE, Georgia Institute of Technology, Atlanta, Georgia
Santanu S. Dey & Guanyi Wang
Computer Science Department, Pontifical Catholic University of Rio de Janeiro, Janeiro, Brazil
Marco Molinaro

Authors

Santanu S. Dey
View author publications
You can also search for this author in PubMed Google Scholar
Marco Molinaro
View author publications
You can also search for this author in PubMed Google Scholar
Guanyi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Santanu S. Dey.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper was published in [46].

Appendices

Appendix

Additional concentration inequalities

We need the standard multiplicative Chernoff bound (see Theorem 4.4 [36]).

Lemma 3

(Chernoff Bound) Let $X_1,\ldots ,X_n$ be independent random variables taking values in [0, 1]. Then for any $\delta > 0$ we have

$$\begin{aligned} \Pr \bigg (\sum _i X_i > (1+\delta ) \mu \bigg ) \,<\, \bigg (\frac{e}{1+\delta }\bigg )^{(1+\delta ) \mu }, \end{aligned}$$

where $\mu = {\mathbb {E}}\sum _i X_i$.

We also need the one-sided Chebychev inequality, see for example Exercise 3.18 of [36].

Lemma 4

(One-sided Chebychev) For any random variable X with finite first and second moments

$$\begin{aligned} \Pr \bigg ( X \le {\mathbb {E}}X - t \bigg ) \,\le \, \frac{\text {Var}(X)}{\text {Var}(X) + t^2}. \end{aligned}$$

Scaling invariance of Theorem 4

Given any data matrix $\varvec{X} \in {\mathbb {R}}^{d \times M}$ with M samples, the sample covariance matrix is $\varvec{A} = \frac{1}{M} \varvec{X}\varvec{X}^{\top }$. Theorem 4 shows that

$$\begin{aligned} \text {opt}^{{\mathcal {F}}}(\varvec{A}) \le \text {ub}^{\mathcal {CR}i}(\varvec{A}) \le \rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + \underbrace{\sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\varvec{A})}{4N^2}}_{\text {additive term}, ~ =: \text {add}(\varvec{A})} =: \text {aff}(\varvec{A}). \end{aligned}$$

Note that rescaling the data matrix $\varvec{X}$ to $\tilde{\varvec{X}} = c \cdot \varvec{X}$ for any constant $c > 0$ does not change the approximation ratios $\rho _{\mathcal {CR}i}^2$ for $i \in \{1, 1', 2\}$ and changes the terms $\text {opt} ^{{\mathcal {F}}}(\varvec{A})$ and $\text {add} (\varvec{A})$ quadratically: letting $\tilde{\varvec{A}} := \frac{1}{M} \tilde{\varvec{X}}\tilde{\varvec{X}}^{\top } = c^2 \varvec{A}$,

$$\begin{aligned} \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})&= \max _{\varvec{V}^{\top } \varvec{V} = \varvec{I}^r, \Vert \varvec{V}\Vert _0 \le k} \text {Tr}(\varvec{V}^{\top }\tilde{\varvec{A}} \varvec{V}) \\&= \max _{\varvec{V}^{\top } \varvec{V} = \varvec{I}^r, \Vert \varvec{V}\Vert _0 \le k} c^2 \cdot \text {Tr}(\varvec{V}^{\top } \varvec{A} \varvec{V}) = c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}), \\ \text {add}(\tilde{\varvec{A}})&= \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\tilde{\varvec{A}})}{4N^2} = \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(c^2 \varvec{A})}{4N^2} = c^2 \cdot \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\varvec{A})}{4N^2} = c^2 \cdot \text {add}(\varvec{A}). \end{aligned}$$

In particular, this implies that the effective multiplicative ratio (emr) between $\text {opt}^{{\mathcal {F}}}(\varvec{A})$ and the affine upper bound $\text {aff} (\varvec{A})$ is invariant under rescaling:

$$\begin{aligned} \text {emr}(\varvec{A})&:= \frac{\text {aff}(\varvec{A})}{\text {opt}^{{\mathcal {F}}}(\varvec{A})} = \frac{\rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + \text {add}(\varvec{A})}{\text {opt}^{{\mathcal {F}}}(\varvec{A})} \\&= \frac{\rho _{\mathcal {CR}i}^2 \cdot c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + c^2 \cdot \text {add}(\varvec{A})}{ c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A})} \\&= \frac{\rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}}) + \text {add}(\tilde{\varvec{A}})}{ \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})} \\&= \frac{\text {aff}(\tilde{\varvec{A}})}{\text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})} = \text {emr}(\tilde{\varvec{A}}). \end{aligned}$$

Greedy heuristic for SPCAgs

1.1 Complete pseudocode

1.2 Proof of Lemma 2

By optimality of $\varvec{V}^t_{S_t}$ we can see that $f(S_t) = f(S_t, \varvec{V}^t_{S_t})$ for all t. Thus, letting $\varvec{G}_t := \varvec{I}^k - \varvec{V}^t_{S_{t}} (\varvec{V}_{S_{t}}^t)^{\top }$ to simplify the notation, we have

$$\begin{aligned} f(S_{t-1})&= f(S_{t-1}, \varvec{V}^{t-1}_{S_{t-1}}) = \left\| \varvec{G}_t\, (\varvec{A}^{1/2})_{S_{t - 1}} \right\| _F^2 + \sum _{j \in S_{t - 1}^C} \left\| (\varvec{A}^{1/2})_{j} \right\| _2^2 \\&= \left\| \varvec{G}_t\, \varvec{A}^{1/2}_{S_{t}} \right\| _F^2 + \sum _{j \in S_t^C} \left\| \varvec{A}^{1/2}_{j} \right\| _2^2 + \underbrace{ \Delta ^{\text {in}}_{j^{\text {in}}} - \Delta ^{\text {out}}_{j^{\text {out}}} }_{> 0} \\&> \left\| \varvec{G}_t \varvec{A}^{1/2}_{S_{t}} \right\| _F^2 + \sum _{j \in S_t^C} \left\| \varvec{A}^{1/2}_{j} \right\| _2^2\\&= f(S_t, \varvec{V}^t_{S_t}) = f(S_t). \end{aligned}$$

1.3 Primal Heuristic Algorithm For Near-Maximal Determinant

Here we present another primal heuristic algorithm that finds a principal submatrix whose determinant is near-maximal.

We compare the performance of the greedy neighborhood search Algorithm 1, with the greedy heuristic (GH) 2 in Tables 7 and 8 where we report the relative gap defined as $\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}$, where $\text {lb}_{\text {GH}}, \text {lb}_{\text {GNS}}$ denote the primal lower bounds of sparse PCA obtained from the greedy heuristic and the greedy neighborhood search respectively.

Table 7 Comparison between GH Algo 2 and GNS Algo 1 on artificial instances where $\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}$

Full size table

Table 8 Comparison between GH Algo 2 and GNS Algo 1 on real instances where $\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}$

Full size table

Based on the numerical results in Tables 7 and 8, the greedy neighborhood search (GNS) algorithm outperforms the greedy heuristic (GH) in every instance.

Techniques for reducing the running time of CIP

In practice, we want to reduce the running time of CIP. Here are the techniques that we used to enhance the efficiency in practice.

1.1 Threshold

The first technique is to reduce the number of SOS-II constraints in the set PLA. Let $\lambda _{\mathrm {TH}}$ be a threshold parameter that splits the eigenvalues $\{\lambda _j\}_{j = 1}^d$ of sample covariance matrix $\varvec{A}$ into two parts $J^+ = \{j: \lambda _j > \lambda _{\mathrm {TH}} \}$ and $J^- = \{j: \lambda _j \le \lambda _{\mathrm {TH}} \}$. The objective function $\text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) $ satisfies

$$\begin{aligned} \text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) \!= \!\sum _{j \in J^+} (\lambda _j \!-\! \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 \!+\! \sum _{j \in J^-} (\lambda _j - \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 + \lambda _{\mathrm {TH}} \sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2, \end{aligned}$$

in which the first term is convex, the second term is concave, and the third term satisfies

$$\begin{aligned} \lambda _{\mathrm {TH}} \sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2 \le r \lambda _{\mathrm {TH}} \end{aligned}$$

(threshold-term)

due to $\sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2 \le r$. Since maximizing a concave function is equivalent to convex optimization, we replace the second term by a new auxiliary variable s and the third term by its upper bound $r \lambda _{\mathrm {TH}}$ such that

$$\begin{aligned} \text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) \le&~ \sum _{j \in J^+} (\lambda _j - \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 - s + r \lambda _{\mathrm {TH}} \end{aligned}$$

(threshold-tech)

where

$$\begin{aligned} s \ge \sum _{j \in J^-} \underbrace{(\lambda _{\mathrm {TH}} - \lambda _j)}_{\ge 0} \sum _{i = 1}^r g_{ji}^2 \end{aligned}$$

(s-var)

is a convex constraint. We select a value of $\lambda _{\mathrm {TH}}$ so that $|J^+| = 3$. Therefore, it is sufficient to construct a piecewise-linear upper approximation for the quadratic terms $g_{ji}^2$ in the first term with $j \in J^+$, i.e., constraint set $\text {PLA}([J^+] \times [r])$. We thus, greatly reduce the number of SOS-II constraints from ${\mathcal {O}}( d \times r )$ to ${\mathcal {O}}( |J^+| \times r )$, i.e. in our experiments to 3r SOS-II constraints.

1.2 Cutting planes

Similar to classical integer programming, we can incorporate additional cutting planes to improve the efficiency.

Cutting plane for sparsity: The first family of cutting-planes is obtained as follows: Since $\Vert \varvec{V}\Vert _0 \le k$ and $\varvec{v}_1, \ldots , \varvec{v}_r$ are orthogonal, by Bessel inequality, we have

$$\begin{aligned}&\sum _{i = 1}^r g_{ji}^2 = \sum _{i = 1}^r (\varvec{a}_j^{\top } \varvec{v}_i)^2 = \varvec{a}_j^{\top } \varvec{V} \varvec{V}^{\top } \varvec{a}_j \le \theta _j^2, \end{aligned}$$

(sparse-g)

$$\begin{aligned}&\sum _{i = 1}^r \xi _{ji} \le \theta _j^2 \left( 1 + \frac{r}{4N^2} \right) . \end{aligned}$$

(sparse-xi)

We call these above cuts–sparse cut since $\theta _j$ is obtained from the row sparsity parameter k.

Cutting plane from objective value: The second type of cutting plane is based on the property: for any symmetric matrix, the sum of its diagonal entries are equal to the sum of its eigenvalues. Let $\varvec{A}_{j_1, j_1}, \ldots , \varvec{A}_{j_k, j_k}$ be the largest k diagonal entries of the sample covariance matrix $\varvec{A}$, we have

Proposition 1

The following are valid cuts for SPCAgs:

$$\begin{aligned} \sum _{j = 1}^d \lambda _j \sum _{i = 1}^r g_{ji}^2 \le \varvec{A}_{j_1, j_1} + \cdots + \varvec{A}_{j_k, j_k}. \end{aligned}$$

(cut-g)

When the splitting points $\{\gamma _{ji}^{\ell }\}_{\ell = - N}^N$ in SOS-II are set to be $\gamma _{ji}^{\ell } = \frac{\ell }{N} \cdot \theta _j$, we have:

$$\begin{aligned} \begin{array}{rcl} {\sum }_{j \in J^+} (\lambda _j - \lambda _{\mathrm {TH}}){\sum }_{i = 1}^r \xi _{ji} - s + g \lambda _{\mathrm {TH}} &{}\le &{} \varvec{A}_{j_1, j_1} + \cdots + \varvec{A}_{j_k, j_k} + {\sum }_{j \in J^+} \frac{r (\lambda _j - \phi ) \theta _j^2}{4 N^2} \\ g &{}\ge &{}{\sum }_{j = 1}^d {\sum }_{i = 1}^r g_{ji}^2. \end{array} \end{aligned}$$

(cut-xi)

1.3 Implemented version of CIP

Thus the implemented version of CIP is

$$\begin{aligned} \begin{array}{llll} \max &{} {\sum }_{j \in J^+} (\lambda _j - \lambda _{\mathrm {LB}}) {\sum }_{i = 1}^r \xi _{ji} - s + r \lambda _{\mathrm {LB}} \\ \text {s.t} &{} \varvec{V} \in \mathcal {CR}2 \\ &{} (g, \xi , \eta ) \in \text {PLA}([J^+] \times [r]) \\ &{} \text {(s-var), (sparse-g), (sparse-xi), (cut-g), (cut-xi)} \end{array} \end{aligned}$$

(CIP-impl)

1.4 Submatrix technique

Proposition 2

Let $X \in {\mathbb {R}}^{m \times n}$ and let $\theta $ be defined as

$$\begin{aligned} \theta := 2\text {max} _{\varvec{V}^1 \in {\mathbb {R}}^{m \times r}, \varvec{V}^2 \in {\mathbb {R}}^{n \times r}} ~ 2 \text {Tr} \left( (\varvec{V}^1)^{\top } \varvec{X} \varvec{V}^2 \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \end{aligned}$$

then $\theta \le \sqrt{r} \Vert X\Vert _F$

Proof

$$\begin{aligned}&~ \max _{\varvec{V}^1, \varvec{V}^2} ~ 2 \text {Tr} \left( (\varvec{V}^1)^{\top } \varvec{X} \varvec{V}^2 \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \\ \Leftrightarrow ~&~ \max _{\varvec{V}^1, \varvec{V}^2} ~ \text {Tr} \left( \begin{pmatrix} \varvec{V}^1)^{\top }&(\varvec{V}^2)^{\top } \end{pmatrix} \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \begin{pmatrix} \varvec{V}^1 \\ \varvec{V}^2 \\ \end{pmatrix} \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \\ \Leftrightarrow ~&~ \max _{\varvec{V}} ~ \text {Tr}\left( \varvec{V}^{\top } \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \varvec{V} \right) \text { s.t. } \varvec{V}^{\top } \varvec{V} = \varvec{I}^r. \end{aligned}$$

Note that the final maximization problem is equal to

$$\begin{aligned}&\max _{\varvec{V}} ~ \text {Tr}\left( \varvec{V}^{\top } \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \varvec{V} \right) \text { s.t. } \varvec{V}^{\top } \varvec{V} = \varvec{I}^r \\&\quad \le \sum _{i = 1}^r \lambda _i \left( \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \right) , \end{aligned}$$

Next we verify that the eigenvalues of

$$\begin{aligned} \left( \begin{array}{cc} 0 &{} X \\ X^{\top }&{} 0 \end{array}\right) \end{aligned}$$

are ± singular values of X: Let $X = U\Sigma W^{\top }$. In particular, note that:

$$\begin{aligned} \begin{array}{rcccl} \left( \begin{array}{cc} 0 &{} U\Sigma W^{\top } \\ W\Sigma U^{\top }&{} 0 \end{array}\right) \left[ \begin{array}{c} u_i \\ w_i \end{array}\right] &{}=&{} \left[ \begin{array}{c} U\Sigma e_i \\ W\Sigma e_i\end{array}\right] &{}=&{} \sigma _i(X)\left[ \begin{array}{c} u_i \\ w_i \end{array}\right] \\ \left( \begin{array}{cc} 0 &{} U\Sigma W^{\top } \\ W\Sigma U^{\top }&{} 0 \end{array}\right) \left[ \begin{array}{c} u_i \\ -w_i \end{array}\right] &{}=&{} \left[ \begin{array}{c} -U\Sigma e_i \\ W\Sigma e_i\end{array}\right] &{}=&{} -\sigma _i(X)\left[ \begin{array}{c} u_i \\ -w_i \end{array}\right] . \end{array} \end{aligned}$$

Therefore, we have

$$\begin{aligned} \sum _{i = 1}^r \lambda _i \left( \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \right) = \sum _{i = 1}^r \sigma _i(\varvec{X}) \le \sqrt{r} \Vert \varvec{X}\Vert _F. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dey, S.S., Molinaro, M. & Wang, G. Solving sparse principal component analysis with global support. Math. Program. 199, 421–459 (2023). https://doi.org/10.1007/s10107-022-01857-w

Download citation

Received: 21 November 2020
Accepted: 09 May 2022
Published: 19 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01857-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Solving sparse principal component analysis with global support

Abstract

Access this article

Similar content being viewed by others

Sparse principal component analysis subject to prespecified cardinality of loadings

Certifiably optimal sparse principal component analysis

Sparse Principal Component Analysis with Missing Observations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Additional concentration inequalities

Lemma 3

Lemma 4

Scaling invariance of Theorem 4

Greedy heuristic for SPCAgs

1.1 Complete pseudocode

1.2 Proof of Lemma 2

1.3 Primal Heuristic Algorithm For Near-Maximal Determinant

Techniques for reducing the running time of CIP

1.1 Threshold

1.2 Cutting planes

Proposition 1

1.3 Implemented version of CIP

1.4 Submatrix technique

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Solving sparse principal component analysis with global support

Abstract

Access this article

Similar content being viewed by others

Sparse principal component analysis subject to prespecified cardinality of loadings

Certifiably optimal sparse principal component analysis

Sparse Principal Component Analysis with Missing Observations

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Additional concentration inequalities

Lemma 3

Lemma 4

Scaling invariance of Theorem 4

Greedy heuristic for SPCAgs

1.1 Complete pseudocode

1.2 Proof of Lemma 2

1.3 Primal Heuristic Algorithm For Near-Maximal Determinant

Techniques for reducing the running time of CIP

1.1 Threshold

1.2 Cutting planes

Proposition 1

1.3 Implemented version of CIP

1.4 Submatrix technique

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation