Certifiably optimal sparse principal component analysis


This paper addresses the sparse principal component analysis (SPCA) problem for covariance matrices in dimension n aiming to find solutions with sparsity k using mixed integer optimization. We propose a tailored branch-and-bound algorithm, Optimal-SPCA, that enables us to solve SPCA to certifiable optimality in seconds for \(n = 100\) s, \(k=10\) s. This same algorithm can be applied to problems with \(n=10{,}000\,\mathrm{s}\) or higher to find high-quality feasible solutions in seconds while taking several hours to prove optimality. We apply our methods to a number of real data sets to demonstrate that our approach scales to the same problem sizes attempted by other methods, while providing superior solutions compared to those methods, explaining a higher portion of variance and permitting complete control over the desired sparsity. The software that was reviewed as part of this submission has been given the DOI (digital object identifier) https://doi.org/10.5281/zenodo.2027898.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    Amini, A.A., Wainwright, M.J.: High-dimensional analysis of semidefinite relaxations for sparse principal components. In: IEEE International Symposium on Information Theory, pp. 2454–2458. IEEE (2008)

  2. 2.

    Asteris, M., Papailiopoulos, D., Kyrillidis, A., Dimakis, A.G.: Sparse PCA via bipartite matchings. In: Advances in Neural Information Processing Systems, pp. 766–774 (2015)

  3. 3.

    Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc. 101(473), 119–137 (2006)

    MathSciNet  MATH  Article  Google Scholar 

  4. 4.

    Beck, A., Vaisbourd, Y.: The sparse principal component analysis problem: optimality conditions and algorithms. J0 Optim. Theory Appl. 170(1), 119–143 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  5. 5.

    Bennett, K.P., Parrado-Hernández, E.: The interplay of optimization and machine learning research. J. Mach. Learn. Res. 7, 1265–1281 (2006)

    MathSciNet  MATH  Google Scholar 

  6. 6.

    Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 931–942 (2017)

    MathSciNet  MATH  Article  Google Scholar 

  7. 7.

    Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(29), 1–53 (2017)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 64(1), 1–44 (2017)

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Bertsimas, D., King, A.: An algorithmic approach to linear regression. Oper. Res. 64(1), 2–16 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  10. 10.

    Bertsimas, D., King, A., Mazumder, R., et al.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    MathSciNet  MATH  Article  Google Scholar 

  11. 11.

    Bertsimas, D., Shioda, R.: Classification and regression via integer optimization. Oper. Res. 55(2), 252–271 (2007)

    MathSciNet  MATH  Article  Google Scholar 

  12. 12.

    Bixby, R.E.: A brief history of linear and mixed-integer programming computation. Doc. Math. Extra Volume: Optimization Stories, 107–121 (2012)

  13. 13.

    Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11 (2011)

    MathSciNet  MATH  Article  Google Scholar 

  14. 14.

    Carrizosa, E., Guerrero, V.: rs-Sparse principal component analysis: a mixed integer nonlinear programming approach with VNS. Comput. Oper. Res. 52, 349–354 (2014)

    MathSciNet  MATH  Article  Google Scholar 

  15. 15.

    Chamberlain, G., Rothschild, M.J.: Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica 51, 1281–1304 (1983)

    MathSciNet  MATH  Article  Google Scholar 

  16. 16.

    Chan, S.O., Papailiopoulos, D., Rubinstein, A.: On the worst-case approximability of sparse PCA. arXiv preprint arXiv:1507.05950 (2015)

  17. 17.

    Chen, Y., Jalali, A., Sanghavi, S., Xu, H.: Clustering partially observed graphs via convex optimization. J. Mach. Learn. Res. 15(1), 2213–2238 (2014)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Computing, J.: Julia micro-benchmarks (2018). https://julialang.org/benchmarks/

  19. 19.

    d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9, 1269–1294 (2008)

    MathSciNet  MATH  Google Scholar 

  20. 20.

    d’Aspremont, A., El Ghaoui, L., Jordan, M.I., Lanckriet, G.R.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)

    MathSciNet  MATH  Article  Google Scholar 

  21. 21.

    Deluzio, K., Astephen, J.: Biomechanical features of gait waveform data associated with knee osteoarthritis: an application of principal component analysis. Gait Posture 25(1), 86–93 (2007)

    Article  Google Scholar 

  22. 22.

    Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the twenty-first international conference on Machine learning, Banff, Alberta, Canada, 04–08 July 2004, p. 29. ACM, New York (2004). https://doi.org/10.1145/1015330.1015408

  23. 23.

    Du, Q., Fowler, J.E.: Hyperspectral image compression using jpeg2000 and principal component analysis. IEEE Geosci. Remote Sens. Lett. 4(2), 201–205 (2007)

    Article  Google Scholar 

  24. 24.

    Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017). https://doi.org/10.1137/15M1020575

    MathSciNet  MATH  Article  Google Scholar 

  25. 25.

    Gurobi Optimization Inc.: Gurobi 7.0 performance benchmarks. http://www.gurobi.com/pdfs/benchmarks.pdf (2015). Accessed 17 Dec 2016

  26. 26.

    Gurobi Optimization Inc.: Gurobi optimizer reference manual (2017). http://www.gurobi.com

  27. 27.

    Hand, D.J., Daly, F., McConway, K., Lunn, D., Ostrowski, E.: A Handbook of Small Data Sets, vol. 1. CRC Press, Boca Raton (1993)

    Google Scholar 

  28. 28.

    Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

    Google Scholar 

  29. 29.

    Hein, M., Bühler, T.: An inverse power method for nonlinear eigenproblems with applications in 1-spectral clustering and sparse PCA. In: Advances in Neural Information Processing Systems, pp. 847–855 (2010)

  30. 30.

    Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)

    MATH  Article  Google Scholar 

  31. 31.

    Hsu, Y.L., Huang, P.Y., Chen, D.T.: Sparse principal component analysis in cancer research. Transl. Cancer Res. 3(3), 182 (2014)

    Google Scholar 

  32. 32.

    IBM: IBM ILOG CPLEX User’s manual (2017). https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

  33. 33.

    Iezzoni, A.F., Pritts, M.P.: Applications of principal component analysis to horticultural research. HortScience 26(4), 334–338 (1991)

    Article  Google Scholar 

  34. 34.

    Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: Probably certifiably correct k-means clustering. Math. Program. 165(2), 605–642 (2017)

    MathSciNet  MATH  Article  Google Scholar 

  35. 35.

    Jeffers, J.N.: Two case studies in the application of principal component analysis. Appl. Stat. 16(3), 225–236 (1967)

    Article  Google Scholar 

  36. 36.

    Jolliffe, I.T.: Rotation of principal components: choice of normalization constraints. J. Appl. Stat. 22(1), 29–35 (1995)

    MathSciNet  Article  Google Scholar 

  37. 37.

    Jolliffe, I.T.: Principal Component Analysis. Wiley, London (2002)

    Google Scholar 

  38. 38.

    Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)

    MathSciNet  Article  Google Scholar 

  39. 39.

    Journée, M., Nesterov, Y., Richtárik, P., Sepulchre, R.: Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 11, 517–553 (2010)

    MathSciNet  MATH  Google Scholar 

  40. 40.

    Kaiser, H.F.: The varimax criterion for analytic rotation in factor analysis. Psychometrika 23(3), 187–200 (1958)

    MATH  Article  Google Scholar 

  41. 41.

    Kumar, V., Kanal, L.N.: Parallel branch-and-bound formulations for and/or tree search. IEEE Trans. Pattern Anal. Mach. Intell. 42(6), 768–778 (1984)

    Article  Google Scholar 

  42. 42.

    Labib, K., Vemuri, V.R.: An application of principal component analysis to the detection and visualization of computer network attacks. Annales des Telecommunications/Ann. Telecommun. 61(1–2), 218–234 (2006)

    Article  Google Scholar 

  43. 43.

    Land, A.H., Doig, A.G.: An automatic method of solving discrete programming problems. Econometrica 28, 497–520 (1960)

    MathSciNet  MATH  Article  Google Scholar 

  44. 44.

    Lee, S., Epstein, M.P., Duncan, R., Lin, X.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)

    Article  Google Scholar 

  45. 45.

    Lee, Y.K., Lee, E.R., Park, B.U.: Principal component analysis in very high-dimensional spaces. Stat. Sin. 22(1), 933–956 (2012)

    MathSciNet  MATH  Google Scholar 

  46. 46.

    Leng, C., Wang, H.: On general adaptive sparse principal component analysis. J. Comput. Graph. Stat. 18(1), 201–215 (2009)

    MathSciNet  Article  Google Scholar 

  47. 47.

    Li, G.J., Wah, B.W.: Coping with anomalies in parallel branch-and-bound algorithms. IEEE Trans. Comput. 100(6), 568–573 (1986)

    MathSciNet  Article  Google Scholar 

  48. 48.

    Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  49. 49.

    Lougee-Heimer, R.: The common optimization interface for operations research. IBM J. Res. Dev. 47(1), 57–66 (2003)

    Article  Google Scholar 

  50. 50.

    Luss, R., Teboulle, M.: Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Rev. 55(1), 65–98 (2013)

    MathSciNet  MATH  Article  Google Scholar 

  51. 51.

    Ma, Z., et al.: Sparse principal component analysis and iterative thresholding. Ann. Stat. 41(2), 772–801 (2013)

    MathSciNet  MATH  Article  Google Scholar 

  52. 52.

    Mangasarian, O.L.: Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J. Mach. Learn. Res. 7, 1517–1530 (2006)

    MathSciNet  MATH  Google Scholar 

  53. 53.

    Mazumder, R., Radchenko, P., Dedieu, A.: Subset selection with shrinkage: sparse linear modeling when the snr is low. arXiv preprint arXiv:1708.03288 (2017)

  54. 54.

    Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA: Exact and greedy algorithms. In: Advances in Neural Information Processing Systems, pp. 915–922 (2005)

  55. 55.

    Nemhauser, G.L.: Integer Programming: the Global Impact. Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro-informs2013.org/data/http_/euro2013.org/wp-content/uploads/nemhauser.pdf (2013). Accessed 9 Sept 2015

  56. 56.

    Papailiopoulos, D.S., Dimakis, A.G., Korokythakis, S.: Sparse PCA through low-rank approximations. ICML 3, 747–755 (2013)

    Google Scholar 

  57. 57.

    Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)

  58. 58.

    Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38(8), 904–909 (2006)

    Article  Google Scholar 

  59. 59.

    Richman, M.B.: Rotation of principal components. J. Climatol. 6(3), 293–335 (1986)

    MathSciNet  Article  Google Scholar 

  60. 60.

    Richtárik, P., Takáč, M., Ahipaşaoğlu, S.D.: Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes. arXiv preprint arXiv:1212.4137 (2012)

  61. 61.

    Scott, D.S.: On the accuracy of the Gerschgorin circle theorem for bounding the spread of a real symmetric matrix. Linear Algebra Appl. 65, 147–155 (1985)

    MathSciNet  MATH  Article  Google Scholar 

  62. 62.

    Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 25, 2960–2968 (2012)

    Google Scholar 

  63. 63.

    Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)

    Google Scholar 

  64. 64.

    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  65. 65.

    Top500 Supercomputer Sites: performance development. http://www.top500.org/statistics/perfdevel/ (2016). Accessed 17 Dec 2016

  66. 66.

    Wilkinson, J.H.: The Algebraic Eigenvalue Problem, vol. 87. Clarendon Press, Oxford (1965)

    Google Scholar 

  67. 67.

    Witten, D., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)

    Article  Google Scholar 

  68. 68.

    Witten, D.M., Tibshirani, R.J.: Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8(1), 1–27 (2009)

    MathSciNet  MATH  Article  Google Scholar 

  69. 69.

    Yanover, C., Meltzer, T., Weiss, Y.: Linear programming relaxations and belief propagation—an empirical study. J. Mach. Learn. Res. 7, 1887–1907 (2006)

    MathSciNet  MATH  Google Scholar 

  70. 70.

    Yuan, X.T., Zhang, T.: Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res. 14, 899–925 (2013)

    MathSciNet  MATH  Google Scholar 

  71. 71.

    Zeng, Z.Q., Yu, H.B., Xu, H.R., Xie, Y.Q., Gao, J.: Fast training support vector machines using parallel sequential minimal optimization. In: 3rd International Conference on Intelligent System and Knowledge Engineering, 2008, vol. 1, pp. 997–1001. ISKE 2008. IEEE (2008)

  72. 72.

    Zhang, Y., Ghaoui, L.E.: Large-scale sparse principal component analysis with application to text data. In: Advances in Neural Information Processing Systems, vol. 24, pp. 532–539 (2011)

  73. 73.

    Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    MathSciNet  Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Dimitris Bertsimas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Overview of the Optimal-SPCA implementation in Julia

A Overview of the Optimal-SPCA implementation in Julia

The linked repository contains an implementation of Optimal-SPCA written in Julia 0.6.0. The latest version of this software is available on GitHub at https://github.com/lauren897/Optimal-SPCA. The Algorithm directory contains the Julia files that comprise the algorithm, and the Data directory contains an example dataset.

In order to run this software, you must install a recent version of Julia from http://julialang.org/downloads/. The most recent version of Julia at the time this code was last tested before publication was Julia 0.6.0.

Two packages must be installed in Julia before the code can be run. These packages are DataFrames, and StatsBase. They can be added by running Pkg.add(“DataFrames”) and Pkg.add(“StatsBase”) respectively.

At this point, the file test.jl should run successfully. To run the script, navigate to the Algorithm directory, and run include(“test.jl”). The script will run Optimal-SPCA on the Pitprops dataset, and then generate an additional random problem and run the algorithm on that problem. It will then identify the first few sparse principal components using Optimal-SPCA sequentially and reporting the cumulative variance explained.

The key method used in the algorithm is is branchAndBound. It takes two required arguments: prob, and k. The variable prob uses a custom type that holds the original data as well as the covariance matrix associated with the problem. (If data is not available, the Cholesky factorization of the covariance matrix will suffice.) The data is presented in an \(m \times n\) array, with \(m\) data points in \(n\) dimensions. The corresponding covariance matrix is \(n \times n\). The parameter k is a positive integer less than \(n\) and represents the desired sparsity.

By default, branchAndBound solves the problem and returns the objective function value, solution vector, and a few performance metrics, including time elapsed and the number of nodes explored. There are many optional parameters, some of which are discussed in detail in our paper. Other parameters have to do with technical aspects of the algorithm, like convergence criteria and resizing arrays. These are commented on in detail in the branchAndBound.jl file where the function is defined.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Berk, L., Bertsimas, D. Certifiably optimal sparse principal component analysis. Math. Prog. Comp. 11, 381–420 (2019). https://doi.org/10.1007/s12532-018-0153-6

Download citation


  • Sparse principal component analysis
  • Principal component analysis
  • Mixed integer optimization
  • Sparse eigenvalues

Mathematics Subject Classification

  • 62H25
  • 65F15
  • 65K05
  • 90C06
  • 90C26
  • 90C27