Abstract
Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.
Similar content being viewed by others
References
Adachi K, Trendafilov NT (2016) Sparse principal component analysis subject to prespecified cardinality of loadings. Comput Stat 31(4):1403–1427
Boulesteix AL, Durif G, Lambert-Lacroix S, Peyre J, Strimmer K (2015) plsgenomics: PLS Analyses for Genomics, R package version 1.3-1 https://CRAN.R-project.org/package=plsgenomics
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
Cavicchia C, Vichi M, Zaccaria G (2020) The ultrametric correlation matrix for modelling hierarchical latent concepts, Adv Data Anal Classif. https://doi.org/10.1007/s11634-020-00400-z
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36
Chung D, Chun H, Keles S (2013) spls: sparse partial least squares (SPLS) regression and classification. R package version 2.2-1. https://CRAN.R-project.org/package=spls
d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM 49(3):434–448
DeSarbo WS, Jedidi K, Cool K, Schendel D (1990) Simultaneous multidimensional unfolding and cluster analysis: an investigationof strategic groups. Mark Lett 2:129–146
Enki DG, Trendafilov NT, Jolliffe IT (2013) A clustering approach to interpretable principal components. J Appl Stat 40(3):583–599
Erichson NB, Zheng P, Aravkin S (2018) sparsepca: Sparse Principal Component Analysis (SPCA), R package version 0.1.2. https://CRAN.R-project.org/package=sparsepca
Erichson NB, Zheng P, Manohar K, Brunton S, Kutz JN, Aravkin AY (2018) Sparse principal component analysis via variable projection. IEEE J Sel Top Signal Process (available at arXiv 1804.00341)
Hennig C (2015) fpc: Flexible Procedures for Clustering. R package version 2.1-10. https://CRAN.R-project.org/package=fpc
Hunter MA, Takane Y (2002) Constrained principal component analysis: various applications. J Educ Behav Stat 27:41–81
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York
Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the lasso. J Comput Graph Stat 12(3):531–547
Ma Z (2013) Sparse principal component analysis and iterative thresholding. Ann Stat 41(2):772–801
Macedo E (2015) Two-step-SDP approach to clustering and dimensionality reduction. Stat Optim Inf Comput 3(3):294–311
Macedo E, Freitas A (2015) The alternating least-squares algorithm for CDPCA. In: Plakhov A et al (eds) Optimization in the natural sciences, communications in computer and information science (CCIS), vol 499. Springer, pp 173–191
Nieto-Librero AB, Galindo-Villardón MP, Freitas A (2019)biplotbootGUI: Bootstrap on Classical Biplots and Clustering Disjoint Biplot, R package version 1.2. http://www.R-project.org/package=biplotbootGUI
Nieto-Librero AB, Sierra C, Vicente-Galindo MP, Ruíz-Barzola O, Galindo-Villardón MP (2017) Clustering disjoint HJ-Biplot: a new tool for identifying pollution patterns in geochemical studies. Chemosphere 176:389–396
Overton ML, Womersley RS (1993) Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math Program 62:321–357
Peng J, Wei Y (2007) Approximating k-means-type clustering via semidefinite programming. SIAM J Optim 18(1):186–205
Peng J, Xia Y (2005) A new theoretical framework for k-means-type clustering. In: Chu W et al (eds) Foundations and advances in data mining studies in fuzziness and soft computing, vol 180. Springer, pp 79–96
R Development Core Team (2019) R: a language and environment for statistical computing. http://www.R-project.org/
Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52:1984–2003
Takane Y, Hunter MA (2001) Constrained principal component analysis: a comprehensive theory. Appl Algebra Eng Commun Comput 12:391–419
Vichi M (2017) Disjoint factor analysis with cross-loadings. Adv Data Anal Classif 11(3):563–591
Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208
Vines S (2000) Simple principal components. Appl Stat 49:441–451
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–648
Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):262–286
Acknowledgements
The authors are grateful to the reviewers for their comments and suggestions that helped to greatly improve the quality of this paper. Special thanks to Giorgia Zaccaria for her valuable comments on the pseudo-F statistic. This work is supported by the Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT—Fundação para a Ciência e a Tecnologia) references UIDB/04106/2020 and UIDP/04106/2020, and by the projects UIDB/00481/2020 and UIDP/00481/2020—FCT—and CENTRO-01-0145-FEDER-022083—Centro Portugal Regional Operational Programme (Centro2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Freitas, A., Macedo, E. & Vichi, M. An empirical comparison of two approaches for CDPCA in high-dimensional data. Stat Methods Appl 30, 1007–1031 (2021). https://doi.org/10.1007/s10260-020-00546-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-020-00546-2