An empirical comparison of two approaches for CDPCA in high-dimensional data

Freitas, Adelaide; Macedo, Eloísa; Vichi, Maurizio

doi:10.1007/s10260-020-00546-2

An empirical comparison of two approaches for CDPCA in high-dimensional data

Original Paper
Published: 18 August 2020

Volume 30, pages 1007–1031, (2021)
Cite this article

Statistical Methods & Applications Aims and scope Submit manuscript

345 Accesses
3 Citations
Explore all metrics

Abstract

Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse clusterability: testing for cluster structure in high dimensions

Article Open access 31 March 2023

Optimizing Gene Expression Analysis Using Clustering Algorithms

Optimal dimensionality selection for independent component analysis of transcriptomic data

Article Open access 08 December 2021

References

Adachi K, Trendafilov NT (2016) Sparse principal component analysis subject to prespecified cardinality of loadings. Comput Stat 31(4):1403–1427
Article MathSciNet Google Scholar
Boulesteix AL, Durif G, Lambert-Lacroix S, Peyre J, Strimmer K (2015) plsgenomics: PLS Analyses for Genomics, R package version 1.3-1 https://CRAN.R-project.org/package=plsgenomics
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27
MathSciNet MATH Google Scholar
Cavicchia C, Vichi M, Zaccaria G (2020) The ultrametric correlation matrix for modelling hierarchical latent concepts, Adv Data Anal Classif. https://doi.org/10.1007/s11634-020-00400-z
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36
Article Google Scholar
Chung D, Chun H, Keles S (2013) spls: sparse partial least squares (SPLS) regression and classification. R package version 2.2-1. https://CRAN.R-project.org/package=spls
d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GRG (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM 49(3):434–448
Article MathSciNet Google Scholar
DeSarbo WS, Jedidi K, Cool K, Schendel D (1990) Simultaneous multidimensional unfolding and cluster analysis: an investigationof strategic groups. Mark Lett 2:129–146
Article Google Scholar
Enki DG, Trendafilov NT, Jolliffe IT (2013) A clustering approach to interpretable principal components. J Appl Stat 40(3):583–599
Article MathSciNet Google Scholar
Erichson NB, Zheng P, Aravkin S (2018) sparsepca: Sparse Principal Component Analysis (SPCA), R package version 0.1.2. https://CRAN.R-project.org/package=sparsepca
Erichson NB, Zheng P, Manohar K, Brunton S, Kutz JN, Aravkin AY (2018) Sparse principal component analysis via variable projection. IEEE J Sel Top Signal Process (available at arXiv 1804.00341)
Hennig C (2015) fpc: Flexible Procedures for Clustering. R package version 2.1-10. https://CRAN.R-project.org/package=fpc
Hunter MA, Takane Y (2002) Constrained principal component analysis: various applications. J Educ Behav Stat 27:41–81
Article Google Scholar
Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York
MATH Google Scholar
Jolliffe IT, Trendafilov NT, Uddin M (2003) A modified principal component technique based on the lasso. J Comput Graph Stat 12(3):531–547
Article MathSciNet Google Scholar
Ma Z (2013) Sparse principal component analysis and iterative thresholding. Ann Stat 41(2):772–801
Article MathSciNet Google Scholar
Macedo E (2015) Two-step-SDP approach to clustering and dimensionality reduction. Stat Optim Inf Comput 3(3):294–311
Article MathSciNet Google Scholar
Macedo E, Freitas A (2015) The alternating least-squares algorithm for CDPCA. In: Plakhov A et al (eds) Optimization in the natural sciences, communications in computer and information science (CCIS), vol 499. Springer, pp 173–191
Nieto-Librero AB, Galindo-Villardón MP, Freitas A (2019)biplotbootGUI: Bootstrap on Classical Biplots and Clustering Disjoint Biplot, R package version 1.2. http://www.R-project.org/package=biplotbootGUI
Nieto-Librero AB, Sierra C, Vicente-Galindo MP, Ruíz-Barzola O, Galindo-Villardón MP (2017) Clustering disjoint HJ-Biplot: a new tool for identifying pollution patterns in geochemical studies. Chemosphere 176:389–396
Article Google Scholar
Overton ML, Womersley RS (1993) Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math Program 62:321–357
Article MathSciNet Google Scholar
Peng J, Wei Y (2007) Approximating k-means-type clustering via semidefinite programming. SIAM J Optim 18(1):186–205
Article MathSciNet Google Scholar
Peng J, Xia Y (2005) A new theoretical framework for k-means-type clustering. In: Chu W et al (eds) Foundations and advances in data mining studies in fuzziness and soft computing, vol 180. Springer, pp 79–96
R Development Core Team (2019) R: a language and environment for statistical computing. http://www.R-project.org/
Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52:1984–2003
Article MathSciNet Google Scholar
Takane Y, Hunter MA (2001) Constrained principal component analysis: a comprehensive theory. Appl Algebra Eng Commun Comput 12:391–419
Article MathSciNet Google Scholar
Vichi M (2017) Disjoint factor analysis with cross-loadings. Adv Data Anal Classif 11(3):563–591
Article MathSciNet Google Scholar
Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208
Article MathSciNet Google Scholar
Vines S (2000) Simple principal components. Appl Stat 49:441–451
MathSciNet MATH Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–648
Article Google Scholar
Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):262–286
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors are grateful to the reviewers for their comments and suggestions that helped to greatly improve the quality of this paper. Special thanks to Giorgia Zaccaria for her valuable comments on the pseudo-F statistic. This work is supported by the Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT—Fundação para a Ciência e a Tecnologia) references UIDB/04106/2020 and UIDP/04106/2020, and by the projects UIDB/00481/2020 and UIDP/00481/2020—FCT—and CENTRO-01-0145-FEDER-022083—Centro Portugal Regional Operational Programme (Centro2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund.

Author information

Authors and Affiliations

Department of Mathematics, University of Aveiro, 3810-193, Aveiro, Portugal
Adelaide Freitas
CIDMA—Center for Research and Development in Mathematics and Applications, University of Aveiro, 3810-193, Aveiro, Portugal
Adelaide Freitas
TEMA—Center for Mechanical Technology and Automation, University of Aveiro, 3810-193, Aveiro, Portugal
Eloísa Macedo
Department of Statistical Sciences, University “La Sapienza”, P.le A. Moro 5, 00185, Rome, Italy
Maurizio Vichi

Authors

Adelaide Freitas
View author publications
You can also search for this author in PubMed Google Scholar
Eloísa Macedo
View author publications
You can also search for this author in PubMed Google Scholar
Maurizio Vichi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adelaide Freitas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22 KB)

Supplementary material 2 (pdf 148 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Freitas, A., Macedo, E. & Vichi, M. An empirical comparison of two approaches for CDPCA in high-dimensional data. Stat Methods Appl 30, 1007–1031 (2021). https://doi.org/10.1007/s10260-020-00546-2

Download citation

Accepted: 07 August 2020
Published: 18 August 2020
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10260-020-00546-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical comparison of two approaches for CDPCA in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Sparse clusterability: testing for cluster structure in high dimensions

Optimizing Gene Expression Analysis Using Clustering Algorithms

Optimal dimensionality selection for independent component analysis of transcriptomic data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 22 KB)

Supplementary material 2 (pdf 148 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An empirical comparison of two approaches for CDPCA in high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Sparse clusterability: testing for cluster structure in high dimensions

Optimizing Gene Expression Analysis Using Clustering Algorithms

Optimal dimensionality selection for independent component analysis of transcriptomic data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 22 KB)

Supplementary material 2 (pdf 148 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation