Skip to main content
Log in

Solving sparse principal component analysis with global support

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Sparse principal component analysis with global support (SPCAgs), is the problem of finding the top-r leading principal components such that all these principal components are linear combinations of a common subset of at most k variables. SPCAgs is a popular dimension reduction tool in statistics that enhances interpretability compared to regular principal component analysis (PCA). Methods for solving SPCAgs in the literature are either greedy heuristics (in the special case of \(r = 1\)) with guarantees under restrictive statistical models or algorithms with stationary point convergence for some regularized reformulation of SPCAgs. Crucially, none of the existing computational methods can efficiently guarantee the quality of the solutions obtained by comparing them against dual bounds. In this work, we first propose a convex relaxation based on operator norms that provably approximates the feasible region of SPCAgs within a \(c_1 + c_2 \sqrt{\log r} = O(\sqrt{\log r})\) factor for some constants \(c_1, c_2\). To prove this result, we use a novel random sparsification procedure that uses the Pietsch-Grothendieck factorization theorem and may be of independent interest. We also propose a simpler relaxation that is second-order cone representable and gives a \((2\sqrt{r})\)-approximation for the feasible region. Using these relaxations, we then propose a convex integer program that provides a dual bound for the optimal value of SPCAgs. Moreover, it also has worst-case guarantees: it is within a multiplicative/additive factor of the original optimal value, and the multiplicative factor is \(O(\log r)\) or O(r) depending on the relaxation used. Finally, we conduct computational experiments that show that our convex integer program provides, within a reasonable time, good upper bounds that are typically significantly better than the natural baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. For some intuition: The first term in the parenthesis controls the variance of \(\widetilde{\varvec{W}}_{ii}\), which is \(\text {Var}(\widetilde{\varvec{W}}_{ii}) \le \frac{\varvec{W}^2_{ii}}{p_i} \le \frac{6}{k}\); the second term controls the largest size of a row of \(\widetilde{\varvec{V}}\), which is \(\Vert \widetilde{\varvec{V}}_{i,:}\Vert _2 \le \left\| \frac{\varvec{V}^*_{i,:}}{p_i} \right\| _2 \le \frac{6}{k} \sum _{i'} \Vert \varvec{V}^*_{i',:}\Vert _2\), which is at most \(6\sqrt{\frac{r}{k}} \le 6\) because \(\varvec{V}^* \in \mathcal {C}\mathcal {R}1\).

  2. Formally, we can append zero rows to \(\frac{\varvec{V}^*_i}{\Vert \varvec{V}^*_i\Vert _2}\) so that the resulting matrix is in \(\mathcal {C}\mathcal {R}1\).

  3. Another idea we explore is to find a principal submatrix whose determinant is near-maximal using a greedy algorithm, see Algorithm 2 in Appendix C.3. More on this in Sect. 5.1.2.

References

  1. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503 (2000)

    Article  Google Scholar 

  2. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. National Academy Sci. 96(12), 6745–6750 (1999)

    Article  Google Scholar 

  3. Asteris, M., Papailiopoulos, D., Kyrillidis, A., Dimakis, A.G.: Sparse PCA via bipartite matchings. In: Advances in Neural Information Processing Systems, pp. 766–774 (2015)

  4. Asteris, M., Papailiopoulos, D.S., Karystinos, G.N.: Sparse principal component of a rank-deficient matrix. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp. 673–677. IEEE (2011)

  5. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Berthet, Q., Rigollet, P.: Computational lower bounds for sparse pca. arXiv:1304.0828 (2013)

  7. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  8. Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Sparse features for PCA-like linear regression. In: Advances in Neural Information Processing Systems, pp. 2285–2293 (2011)

  9. Burgel, P.R., Paillasseur, J., Caillaud, D., Tillie-Leblond, I., Chanez, P., Escamilla, R., Perez, T., Carré, P., Roche, N., et al.: Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur. Respiratory J. 36(3), 531–539 (2010)

    Article  Google Scholar 

  10. Cai, T., Ma, Z., Wu, Y.: Optimal estimation and rank detection for sparse spiked covariance matrices. Probability Theory and Related Fields 161(3–4), 781–815 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  11. Cai, T.T., Ma, Z., Wu, Y., et al.: Sparse PCA: Optimal rates and adaptive estimation. Ann. Stat. 41(6), 3074–3110 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  12. Chan, S.O., Papailliopoulos, D., Rubinstein, A.: On the approximability of sparse PCA. In: Conference on Learning Theory, pp. 623–646 (2016)

  13. Chen, S., Ma, S., Xue, L., Zou, H.: An alternating manifold proximal gradient method for sparse PCA and sparse CCA. arXiv:1903.11576 (2019)

  14. d’Aspremont, A., Bach, F., El Ghaoui, L.: Approximation bounds for sparse principal component analysis. Math. Program. 148(1–2), 89–110 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9(Jul), 1269–1294 (2008)

    MathSciNet  MATH  Google Scholar 

  16. d’Aspremont, A., Ghaoui, L.E., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. In: Advances in neural information processing systems, pp. 41–48 (2005)

  17. Del Pia, A.: Sparse PCA on fixed-rank matrices. http://www.optimization-online.org/DB_HTML/2019/07/7307.html (2019)

  18. Deshpande, Y., Montanari, A.: Sparse PCA via covariance thresholding. J. Mach. Learn. Res. 17(1), 4913–4953 (2016)

    MathSciNet  MATH  Google Scholar 

  19. Dey, S.S., Mazumder, R., Wang, G.: A convex integer programming approach for optimal sparse pca. arXiv:1810.09062 (2018)

  20. Erichson, N.B., Zheng, P., Manohar, K., Brunton, S.L., Kutz, J.N., Aravkin, A.Y.: Sparse principal component analysis via variable projection. arXiv:1804.00341 (2018)

  21. Gallivan, K.A., Absil, P.: Note on the convex hull of the stiefel manifold. Technical note (2010)

  22. Gu, Q., Wang, Z., Liu, H.: Sparse PCA with oracle property. In: Advances in neural information processing systems, pp. 1529–1537 (2014)

  23. Hiriart-Urruty, J.B., Lemaréchal, C.: Fundamentals of convex analysis. Springer Science & Business Media (2012)

  24. Johnstone, I.M., Lu, A.Y.: Sparse principal components analysis. arXiv:0901.4392 (2009)

  25. Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  26. Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)

    Article  MathSciNet  Google Scholar 

  27. Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11(Feb), 517–553 (2010)

    MathSciNet  MATH  Google Scholar 

  28. Kannan, R., Vempala, S.: Randomized algorithms in numerical linear algebra. Acta Numerica 26, 95 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kim, J., Tawarmalani, M., Richard, J.P.P.: Convexification of permutation-invariant sets and applications. arXiv:1910.02573 (2019)

  30. Krauthgamer, R., Nadler, B., Vilenchik, D., et al.: Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 43(3), 1300–1322 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  31. Lei, J., Vu, V.Q., et al.: Sparsistency and agnostic inference in sparse PCA. Ann. Stat. 43(1), 299–322 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  32. Ma, S.: Alternating direction method of multipliers for sparse principal component analysis. J. Oper. Res. Soc. China 1(2), 253–274 (2013)

    Article  MATH  Google Scholar 

  33. Ma, T., Wigderson, A.: Sum-of-squares lower bounds for sparse pca. In: Advances in Neural Information Processing Systems, pp. 1612–1620 (2015)

  34. Mackey, L.W.: Deflation methods for sparse PCA. In: Advances in neural information processing systems, pp. 1017–1024 (2009)

  35. Magdon-Ismail, M.: NP-hardness and inapproximability of sparse PCA. Inf. Process. Lett. 126, 35–38 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  36. Mitzenmacher, M., Upfal, E.: Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press (2017)

  37. Papailiopoulos, D., Dimakis, A., Korokythakis, S.: Sparse PCA through low-rank approximations. In: International Conference on Machine Learning, pp. 747–755 (2013)

  38. Pietsch, A.: Operator ideals, vol. 16. Deutscher Verlag der Wissenschaften (1978)

  39. PROBEL, C.J., TROPP, J.A.: Technical report no. 2011-02 august 2011 (2011)

  40. Sigg, C.D., Buhmann, J.M.: Expectation-maximization for sparse and non-negative PCA. In: Proceedings of the 25th international conference on Machine learning, pp. 960–967. ACM (2008)

  41. Steinberg, D.: Computation of matrix norms with applications to robust optimization. Research thesis, Technion-Israel University of Technology 2 (2005)

  42. Tropp, J.A.: Column subset selection, matrix factorization, and eigenvalue optimization. In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 978–986. SIAM (2009)

  43. Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  44. Vu, V., Lei, J.: Minimax rates of estimation for sparse PCA in high dimensions. In: Artificial intelligence and statistics, pp. 1278–1286 (2012)

  45. Vu, V.Q., Cho, J., Lei, J., Rohe, K.: Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In: Advances in neural information processing systems, pp. 2670–2678 (2013)

  46. Wang, G., Dey, S.: Upper bounds for model-free row-sparse principal component analysis. In: Proceedings of the International Conference on Machine Learning (2020)

  47. Wang, Z., Lu, H., Liu, H.: Tighten after relax: Minimax-optimal sparse PCA in polynomial time. In: Advances in neural information processing systems, pp. 3383–3391 (2014)

  48. Wolsey, L.A., Nemhauser, G.L.: Integer and combinatorial optimization, vol. 55. Wiley, New york (1999)

    MATH  Google Scholar 

  49. Yeung, K.Y., Ruzzo, W.L.: Principal component analysis for clustering gene expression data. Bioinformatics 17(9), 763–774 (2001)

    Article  Google Scholar 

  50. Yongchun Li, W.X.: Exact and approximation algorithms for sparse PCA. http://www.optimization-online.org/DB_HTML/2020/05/7802.html (2020)

  51. Yuan, X.T., Zhang, T.: Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 14(Apr), 899–925 (2013)

    MathSciNet  MATH  Google Scholar 

  52. Zhang, Y., d’Aspremont, A., El Ghaoui, L.: Sparse PCA: Convex relaxations, algorithms and applications. In: Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 915–940. Springer (2012)

  53. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for excellent comments that significantly improved the paper. In particular, the SDP presented in Sect. 2.2 and the heuristic method in Appendix B.3 have been suggested by the reviewers. Marco Molinaro was supported in part by the Coordenaćão de Aperfeićoamento de Pessoal de Nível Superior (CAPES, Brasil) - Finance Code 001, by Bolsa de Produtividade em Pesquisa \(\#3\)12751/2021-4 from CNPq, FAPERJ grant “Jovem Cientista do Nosso Estado”, and by the CAPES-PrInt program. Santanu S. Dey would like to gratefully acknowledge the support of the grant N000141912323 from ONR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Santanu S. Dey.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper was published in [46].

Appendices

Appendix

Additional concentration inequalities

We need the standard multiplicative Chernoff bound (see Theorem 4.4 [36]).

Lemma 3

(Chernoff Bound) Let \(X_1,\ldots ,X_n\) be independent random variables taking values in [0, 1]. Then for any \(\delta > 0\) we have

$$\begin{aligned} \Pr \bigg (\sum _i X_i > (1+\delta ) \mu \bigg ) \,<\, \bigg (\frac{e}{1+\delta }\bigg )^{(1+\delta ) \mu }, \end{aligned}$$

where \(\mu = {\mathbb {E}}\sum _i X_i\).

We also need the one-sided Chebychev inequality, see for example Exercise 3.18 of [36].

Lemma 4

(One-sided Chebychev) For any random variable X with finite first and second moments

$$\begin{aligned} \Pr \bigg ( X \le {\mathbb {E}}X - t \bigg ) \,\le \, \frac{\text {Var}(X)}{\text {Var}(X) + t^2}. \end{aligned}$$

Scaling invariance of Theorem 4

Given any data matrix \(\varvec{X} \in {\mathbb {R}}^{d \times M}\) with M samples, the sample covariance matrix is \(\varvec{A} = \frac{1}{M} \varvec{X}\varvec{X}^{\top }\). Theorem 4 shows that

$$\begin{aligned} \text {opt}^{{\mathcal {F}}}(\varvec{A}) \le \text {ub}^{\mathcal {CR}i}(\varvec{A}) \le \rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + \underbrace{\sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\varvec{A})}{4N^2}}_{\text {additive term}, ~ =: \text {add}(\varvec{A})} =: \text {aff}(\varvec{A}). \end{aligned}$$

Note that rescaling the data matrix \(\varvec{X}\) to \(\tilde{\varvec{X}} = c \cdot \varvec{X}\) for any constant \(c > 0\) does not change the approximation ratios \(\rho _{\mathcal {CR}i}^2\) for \(i \in \{1, 1', 2\}\) and changes the terms \(\text {opt} ^{{\mathcal {F}}}(\varvec{A})\) and \(\text {add} (\varvec{A})\) quadratically: letting \(\tilde{\varvec{A}} := \frac{1}{M} \tilde{\varvec{X}}\tilde{\varvec{X}}^{\top } = c^2 \varvec{A}\),

$$\begin{aligned} \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})&= \max _{\varvec{V}^{\top } \varvec{V} = \varvec{I}^r, \Vert \varvec{V}\Vert _0 \le k} \text {Tr}(\varvec{V}^{\top }\tilde{\varvec{A}} \varvec{V}) \\&= \max _{\varvec{V}^{\top } \varvec{V} = \varvec{I}^r, \Vert \varvec{V}\Vert _0 \le k} c^2 \cdot \text {Tr}(\varvec{V}^{\top } \varvec{A} \varvec{V}) = c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}), \\ \text {add}(\tilde{\varvec{A}})&= \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\tilde{\varvec{A}})}{4N^2} = \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(c^2 \varvec{A})}{4N^2} = c^2 \cdot \sum _{j = 1}^d \frac{r \theta _j^2 \lambda _j(\varvec{A})}{4N^2} = c^2 \cdot \text {add}(\varvec{A}). \end{aligned}$$

In particular, this implies that the effective multiplicative ratio (emr) between \(\text {opt}^{{\mathcal {F}}}(\varvec{A})\) and the affine upper bound \(\text {aff} (\varvec{A})\) is invariant under rescaling:

$$\begin{aligned} \text {emr}(\varvec{A})&:= \frac{\text {aff}(\varvec{A})}{\text {opt}^{{\mathcal {F}}}(\varvec{A})} = \frac{\rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + \text {add}(\varvec{A})}{\text {opt}^{{\mathcal {F}}}(\varvec{A})} \\&= \frac{\rho _{\mathcal {CR}i}^2 \cdot c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A}) + c^2 \cdot \text {add}(\varvec{A})}{ c^2 \cdot \text {opt}^{{\mathcal {F}}}(\varvec{A})} \\&= \frac{\rho _{\mathcal {CR}i}^2 \cdot \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}}) + \text {add}(\tilde{\varvec{A}})}{ \text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})} \\&= \frac{\text {aff}(\tilde{\varvec{A}})}{\text {opt}^{{\mathcal {F}}}(\tilde{\varvec{A}})} = \text {emr}(\tilde{\varvec{A}}). \end{aligned}$$

Greedy heuristic for SPCAgs

1.1 Complete pseudocode

figure b

1.2 Proof of Lemma 2

By optimality of \(\varvec{V}^t_{S_t}\) we can see that \(f(S_t) = f(S_t, \varvec{V}^t_{S_t})\) for all t. Thus, letting \(\varvec{G}_t := \varvec{I}^k - \varvec{V}^t_{S_{t}} (\varvec{V}_{S_{t}}^t)^{\top }\) to simplify the notation, we have

$$\begin{aligned} f(S_{t-1})&= f(S_{t-1}, \varvec{V}^{t-1}_{S_{t-1}}) = \left\| \varvec{G}_t\, (\varvec{A}^{1/2})_{S_{t - 1}} \right\| _F^2 + \sum _{j \in S_{t - 1}^C} \left\| (\varvec{A}^{1/2})_{j} \right\| _2^2 \\&= \left\| \varvec{G}_t\, \varvec{A}^{1/2}_{S_{t}} \right\| _F^2 + \sum _{j \in S_t^C} \left\| \varvec{A}^{1/2}_{j} \right\| _2^2 + \underbrace{ \Delta ^{\text {in}}_{j^{\text {in}}} - \Delta ^{\text {out}}_{j^{\text {out}}} }_{> 0} \\&> \left\| \varvec{G}_t \varvec{A}^{1/2}_{S_{t}} \right\| _F^2 + \sum _{j \in S_t^C} \left\| \varvec{A}^{1/2}_{j} \right\| _2^2\\&= f(S_t, \varvec{V}^t_{S_t}) = f(S_t). \end{aligned}$$

1.3 Primal Heuristic Algorithm For Near-Maximal Determinant

Here we present another primal heuristic algorithm that finds a principal submatrix whose determinant is near-maximal.

figure c

We compare the performance of the greedy neighborhood search Algorithm 1, with the greedy heuristic (GH) 2 in Tables 7 and  8 where we report the relative gap defined as \(\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}\), where \(\text {lb}_{\text {GH}}, \text {lb}_{\text {GNS}}\) denote the primal lower bounds of sparse PCA obtained from the greedy heuristic and the greedy neighborhood search respectively.

Table 7 Comparison between GH Algo 2 and GNS Algo 1 on artificial instances where \(\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}\)
Table 8 Comparison between GH Algo 2 and GNS Algo 1 on real instances where \(\text {Gap} := \frac{\text {lb}_{\text {GH}}}{\text {lb}_{\text {GNS}}}\)

Based on the numerical results in Tables 7 and  8, the greedy neighborhood search (GNS) algorithm outperforms the greedy heuristic (GH) in every instance.

Techniques for reducing the running time of CIP

In practice, we want to reduce the running time of CIP. Here are the techniques that we used to enhance the efficiency in practice.

1.1 Threshold

The first technique is to reduce the number of SOS-II constraints in the set PLA. Let \(\lambda _{\mathrm {TH}}\) be a threshold parameter that splits the eigenvalues \(\{\lambda _j\}_{j = 1}^d\) of sample covariance matrix \(\varvec{A}\) into two parts \(J^+ = \{j: \lambda _j > \lambda _{\mathrm {TH}} \}\) and \(J^- = \{j: \lambda _j \le \lambda _{\mathrm {TH}} \}\). The objective function \(\text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) \) satisfies

$$\begin{aligned} \text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) \!= \!\sum _{j \in J^+} (\lambda _j \!-\! \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 \!+\! \sum _{j \in J^-} (\lambda _j - \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 + \lambda _{\mathrm {TH}} \sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2, \end{aligned}$$

in which the first term is convex, the second term is concave, and the third term satisfies

$$\begin{aligned} \lambda _{\mathrm {TH}} \sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2 \le r \lambda _{\mathrm {TH}} \end{aligned}$$
(threshold-term)

due to \(\sum _{j = 1}^d \sum _{i = 1}^r g_{ji}^2 \le r\). Since maximizing a concave function is equivalent to convex optimization, we replace the second term by a new auxiliary variable s and the third term by its upper bound \(r \lambda _{\mathrm {TH}}\) such that

$$\begin{aligned} \text {Tr} \left( \varvec{V}^{\top } \varvec{A} \varvec{V} \right) \le&~ \sum _{j \in J^+} (\lambda _j - \lambda _{\mathrm {TH}}) \sum _{i = 1}^r g_{ji}^2 - s + r \lambda _{\mathrm {TH}} \end{aligned}$$
(threshold-tech)

where

$$\begin{aligned} s \ge \sum _{j \in J^-} \underbrace{(\lambda _{\mathrm {TH}} - \lambda _j)}_{\ge 0} \sum _{i = 1}^r g_{ji}^2 \end{aligned}$$
(s-var)

is a convex constraint. We select a value of \(\lambda _{\mathrm {TH}}\) so that \(|J^+| = 3\). Therefore, it is sufficient to construct a piecewise-linear upper approximation for the quadratic terms \(g_{ji}^2\) in the first term with \(j \in J^+\), i.e., constraint set \(\text {PLA}([J^+] \times [r])\). We thus, greatly reduce the number of SOS-II constraints from \({\mathcal {O}}( d \times r )\) to \({\mathcal {O}}( |J^+| \times r )\), i.e. in our experiments to 3r SOS-II constraints.

1.2 Cutting planes

Similar to classical integer programming, we can incorporate additional cutting planes to improve the efficiency.

Cutting plane for sparsity: The first family of cutting-planes is obtained as follows: Since \(\Vert \varvec{V}\Vert _0 \le k\) and \(\varvec{v}_1, \ldots , \varvec{v}_r\) are orthogonal, by Bessel inequality, we have

$$\begin{aligned}&\sum _{i = 1}^r g_{ji}^2 = \sum _{i = 1}^r (\varvec{a}_j^{\top } \varvec{v}_i)^2 = \varvec{a}_j^{\top } \varvec{V} \varvec{V}^{\top } \varvec{a}_j \le \theta _j^2, \end{aligned}$$
(sparse-g)
$$\begin{aligned}&\sum _{i = 1}^r \xi _{ji} \le \theta _j^2 \left( 1 + \frac{r}{4N^2} \right) . \end{aligned}$$
(sparse-xi)

We call these above cuts–sparse cut since \(\theta _j\) is obtained from the row sparsity parameter k.

Cutting plane from objective value: The second type of cutting plane is based on the property: for any symmetric matrix, the sum of its diagonal entries are equal to the sum of its eigenvalues. Let \(\varvec{A}_{j_1, j_1}, \ldots , \varvec{A}_{j_k, j_k}\) be the largest k diagonal entries of the sample covariance matrix \(\varvec{A}\), we have

Proposition 1

The following are valid cuts for SPCAgs:

$$\begin{aligned} \sum _{j = 1}^d \lambda _j \sum _{i = 1}^r g_{ji}^2 \le \varvec{A}_{j_1, j_1} + \cdots + \varvec{A}_{j_k, j_k}. \end{aligned}$$
(cut-g)

When the splitting points \(\{\gamma _{ji}^{\ell }\}_{\ell = - N}^N\) in SOS-II are set to be \(\gamma _{ji}^{\ell } = \frac{\ell }{N} \cdot \theta _j\), we have:

$$\begin{aligned} \begin{array}{rcl} {\sum }_{j \in J^+} (\lambda _j - \lambda _{\mathrm {TH}}){\sum }_{i = 1}^r \xi _{ji} - s + g \lambda _{\mathrm {TH}} &{}\le &{} \varvec{A}_{j_1, j_1} + \cdots + \varvec{A}_{j_k, j_k} + {\sum }_{j \in J^+} \frac{r (\lambda _j - \phi ) \theta _j^2}{4 N^2} \\ g &{}\ge &{}{\sum }_{j = 1}^d {\sum }_{i = 1}^r g_{ji}^2. \end{array} \end{aligned}$$
(cut-xi)

1.3 Implemented version of CIP

Thus the implemented version of CIP is

$$\begin{aligned} \begin{array}{llll} \max &{} {\sum }_{j \in J^+} (\lambda _j - \lambda _{\mathrm {LB}}) {\sum }_{i = 1}^r \xi _{ji} - s + r \lambda _{\mathrm {LB}} \\ \text {s.t} &{} \varvec{V} \in \mathcal {CR}2 \\ &{} (g, \xi , \eta ) \in \text {PLA}([J^+] \times [r]) \\ &{} \text {(s-var), (sparse-g), (sparse-xi), (cut-g), (cut-xi)} \end{array} \end{aligned}$$
(CIP-impl)

1.4 Submatrix technique

Proposition 2

Let \(X \in {\mathbb {R}}^{m \times n}\) and let \(\theta \) be defined as

$$\begin{aligned} \theta := 2\text {max} _{\varvec{V}^1 \in {\mathbb {R}}^{m \times r}, \varvec{V}^2 \in {\mathbb {R}}^{n \times r}} ~ 2 \text {Tr} \left( (\varvec{V}^1)^{\top } \varvec{X} \varvec{V}^2 \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \end{aligned}$$

then \(\theta \le \sqrt{r} \Vert X\Vert _F\)

Proof

$$\begin{aligned}&~ \max _{\varvec{V}^1, \varvec{V}^2} ~ 2 \text {Tr} \left( (\varvec{V}^1)^{\top } \varvec{X} \varvec{V}^2 \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \\ \Leftrightarrow ~&~ \max _{\varvec{V}^1, \varvec{V}^2} ~ \text {Tr} \left( \begin{pmatrix} \varvec{V}^1)^{\top }&(\varvec{V}^2)^{\top } \end{pmatrix} \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \begin{pmatrix} \varvec{V}^1 \\ \varvec{V}^2 \\ \end{pmatrix} \right) \text { s.t. } (\varvec{V}^1)^{\top } \varvec{V}^1 + (\varvec{V}^2)^{\top } \varvec{V}^2 = \varvec{I}^r, \\ \Leftrightarrow ~&~ \max _{\varvec{V}} ~ \text {Tr}\left( \varvec{V}^{\top } \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \varvec{V} \right) \text { s.t. } \varvec{V}^{\top } \varvec{V} = \varvec{I}^r. \end{aligned}$$

Note that the final maximization problem is equal to

$$\begin{aligned}&\max _{\varvec{V}} ~ \text {Tr}\left( \varvec{V}^{\top } \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \varvec{V} \right) \text { s.t. } \varvec{V}^{\top } \varvec{V} = \varvec{I}^r \\&\quad \le \sum _{i = 1}^r \lambda _i \left( \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \right) , \end{aligned}$$

Next we verify that the eigenvalues of

$$\begin{aligned} \left( \begin{array}{cc} 0 &{} X \\ X^{\top }&{} 0 \end{array}\right) \end{aligned}$$

are ± singular values of X: Let \(X = U\Sigma W^{\top }\). In particular, note that:

$$\begin{aligned} \begin{array}{rcccl} \left( \begin{array}{cc} 0 &{} U\Sigma W^{\top } \\ W\Sigma U^{\top }&{} 0 \end{array}\right) \left[ \begin{array}{c} u_i \\ w_i \end{array}\right] &{}=&{} \left[ \begin{array}{c} U\Sigma e_i \\ W\Sigma e_i\end{array}\right] &{}=&{} \sigma _i(X)\left[ \begin{array}{c} u_i \\ w_i \end{array}\right] \\ \left( \begin{array}{cc} 0 &{} U\Sigma W^{\top } \\ W\Sigma U^{\top }&{} 0 \end{array}\right) \left[ \begin{array}{c} u_i \\ -w_i \end{array}\right] &{}=&{} \left[ \begin{array}{c} -U\Sigma e_i \\ W\Sigma e_i\end{array}\right] &{}=&{} -\sigma _i(X)\left[ \begin{array}{c} u_i \\ -w_i \end{array}\right] . \end{array} \end{aligned}$$

Therefore, we have

$$\begin{aligned} \sum _{i = 1}^r \lambda _i \left( \begin{pmatrix} 0 &{} \varvec{X} \\ \varvec{X}^{\top } &{} 0 \\ \end{pmatrix} \right) = \sum _{i = 1}^r \sigma _i(\varvec{X}) \le \sqrt{r} \Vert \varvec{X}\Vert _F. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dey, S.S., Molinaro, M. & Wang, G. Solving sparse principal component analysis with global support. Math. Program. 199, 421–459 (2023). https://doi.org/10.1007/s10107-022-01857-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01857-w

Keywords

Mathematics Subject Classification

Navigation