Skip to main content

Advertisement

Log in

Clustering and feature selection using sparse principal component analysis

  • Published:
Optimization and Engineering Aims and scope Submit manuscript

Abstract

In this paper, we study the application of sparse principal component analysis (PCA) to clustering and feature selection problems. Sparse PCA seeks sparse factors, or linear combinations of the data variables, explaining a maximum amount of variance in the data while having only a limited number of nonzero coefficients. PCA is often used as a simple clustering technique and sparse factors allow us here to interpret the clusters in terms of a reduced set of variables. We begin with a brief introduction and motivation on sparse PCA and detail our implementation of the algorithm in d’Aspremont et al. (SIAM Rev. 49(3):434–448, 2007). We then apply these results to some classic clustering and feature selection problems arising in biology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403:503–511

    Article  Google Scholar 

  • Alon A, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Cell Biol 96:6745–6750

    Google Scholar 

  • Cadima J, Jolliffe IT (1995) Loadings and correlations in the interpretation of principal components. J Appl Stat 22:203–214

    Article  MathSciNet  Google Scholar 

  • Candès EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51(12):4203–4215

    Article  Google Scholar 

  • d’Aspremont A (2005) Smooth optimization with approximate gradient. ArXiv:math.OC/0512344

  • d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev 49(3):434–448

    Article  MATH  MathSciNet  Google Scholar 

  • Donoho DL, Tanner J (2005) Sparse nonnegative solutions of underdetermined linear equations by linear programming. Proc Natl Acad Sci 102(27):9446–9451

    Article  MathSciNet  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422

    Article  MATH  Google Scholar 

  • Huang TM, Kecman V (2005) Gene extraction for cancer diagnosis by support vector machines-an improvement. Artif Intell Med 35:185–194

    Article  Google Scholar 

  • Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547

    Article  MathSciNet  Google Scholar 

  • Moler C, Van Loan C (2003) Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev 45(1):3–49

    Article  MATH  MathSciNet  Google Scholar 

  • Moghaddam B, Weiss Y, Avidan S (2006a) Generalized spectral bounds for sparse LDA. In: International conference on machine learning

  • Moghaddam B, Weiss Y, Avidan S (2006b) Spectral bounds for sparse PCA: Exact and greedy algorithms. Adv Neural Inf Process Syst, 18

  • Nesterov Y (1983) A method of solving a convex programming problem with convergence rate O(1/k 2). Sov Math Dokl 27(2):372–376

    MATH  Google Scholar 

  • Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127–152

    Article  MATH  MathSciNet  Google Scholar 

  • Pataki G (1998) On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math Oper Res 23(2):339–358

    Article  MATH  MathSciNet  Google Scholar 

  • Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S (2003) Rankgene: identification of diagnostic genes based on expression data. Bioinformatics 19:1578–1579

    Article  Google Scholar 

  • Srebro N, Shakhnarovich G, Roweis S (2006) An investigation of computational and informational limits in Gaussian mixture clustering. In: Proceedings of the 23rd international conference on machine learning, pp 865–872

  • Sturm J (1999) Using SEDUMI 1.0x, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 11:625–653

    Article  MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267–288

    MATH  MathSciNet  Google Scholar 

  • Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin

    MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320

    Article  MATH  MathSciNet  Google Scholar 

  • Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):265–286

    Article  MathSciNet  Google Scholar 

  • Zhang Z, Zha H, Simon H (2002) Low rank approximations with sparse factors I: basic algorithms and error analysis. SIAM J Matrix Anal Appl 23(3):706–727

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang Z, Zha H, Simon H (2004) Low rank approximations with sparse factors II: penalized methods with discrete Newton-like iterations. SIAM J Matrix Anal Appl 25(4):901–920

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandre d’Aspremont.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luss, R., d’Aspremont, A. Clustering and feature selection using sparse principal component analysis. Optim Eng 11, 145–157 (2010). https://doi.org/10.1007/s11081-008-9057-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11081-008-9057-z

Keywords

Navigation