Abstract
Principal component analysis (PCA) is a commonly used statistical tool for dimension reduction. An important issue in PCA is to determine the rank, which is the number of dominant eigenvalues of the covariance matrix. Among information-based criteria, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the two most common ones. Both use the number of free parameters for assessing model complexity, which requires the validity of the simple spiked covariance model. As a result, AIC and BIC may suffer from the problem of model misspecification when the tail eigenvalues do not follow the simple spiked model assumption. To alleviate this difficulty, we adopt the idea of the generalized information criterion (GIC) to propose a model complexity measure for PCA rank selection. The proposed model complexity takes into account the sizes of eigenvalues and, hence, is more robust to model misspecification. Asymptotic properties of our GIC are established under the high-dimensional setting, where \(n\rightarrow \infty \) and \(p/n\rightarrow c >0\). Our asymptotic results show that GIC is better than AIC in excluding noise eigenvalues, and is more sensitive than BIC in detecting signal eigenvalues. Numerical studies and a real data example are presented.
Similar content being viewed by others
References
Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70:191–221
Bai ZD, Silverstein JW (2010) Spectral analysis of large dimensional random matrices, 2nd edn. Springer, New York
Bai ZD, Yao J (2012) On sample eigenvalues in a generalized spiked population model. J Multivar Anal 106:167–177
Bai Z, Chen J, Yao J (2010) On estimation of the population spectral distribution from a high-dimensional sample covariance matrix. Aust N Z J Stat 52:423–437
Bai ZD, Choi KP, Fujikoshi Y (2018) Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. Ann Stat 46:1050–1076
Chan T, Jia K, Gao S, Lu J, Zeng Z, Ma Y (2015) PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Process 24:5017–5032
Chung SC, Wang SH, Niu PY, Huang SY, Chang WH, Tu IP (2020) Two-stage dimension reduction for noisy high-dimensional images and application to Cryogenic Electron Microscopy. Ann Math Sci Appl 5:283–316
Ding Q, Kolaczyk ED (2013) A compressed PCA subspace method for anomaly detection in high-dimensional data. IEEE Trans Inf Theory 59:7419–7433
Fujikoshi Y, Sakurai T (2016) Some properties of estimation criteria for dimensionality in principal component analysis. Am J Math Manag Sci 35:133–142
Gorban AN, Kégl B, Wunsch DC, Zinovyev AY (eds) (2008) Principal manifolds for data visualization and dimension reduction. Springer, Berlin
Gottumukkal R, Asari VK (2004) An improved face recognition technique based on modular PCA approach. Pattern Recogn Lett 25(4):429–436
Hsu HL, Ing CK, Tong H (2019) On model selection from a finite family of possibly misspecified time series models. Ann Stat 47:1061–1087
Jiang Q, Yan X, Zhao W (2013) Fault detection and diagnosis in chemical processes using sensitive principal component analysis. Ind Eng Chem Res 52:1635–1644
Jiang Q, Huang B, Yan X (2016) GMM and optimal principal components-based Bayesian method for multimode fault diagnosis. Comput Chem Eng 84:338–349
Konishi S, Kitagawa G (1996) Generalized information criteria in model selection. Biometrika 83:875–890
Lv J, Liu JS (2014) Model selection principles in misspecified models. J R Stat Soc B 76:141–167
Wax M, Kailath T (1985) Detection of signals by information theoretic criteria. IEEE Trans Acoust Speech Signal Process 33:387–392
Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, Keilbaugh SA, Bewtra M, Knights D, Walters WA, Knight R, Sinha R, Gilroy E, Gupta K, Baldassano R, Nessel L, Li H, Bushman FD, Lewis JD (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334:105–108
Yao J, Zheng S, Bai Z (2015) Large sample covariance matrices and high-dimensional data analysis. Cambridge University Press, New York
Zhang L, Dong W, Zhang D, Shi G (2010) Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recogn 43:1531–1549
Zheng Z, Lv J, Lin W (2020) Nonsparse learning with latent variables. Oper Res (to appear)
Funding
Funding was provided by Ministry of Science and Technology, Taiwan. (110-2118-M-002-001-MY3 for HH; MOST 107-2118-M-001-012-MY3 for SYH).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hung, H., Huang, SY. & Ing, CK. A generalized information criterion for high-dimensional PCA rank selection. Stat Papers 63, 1295–1321 (2022). https://doi.org/10.1007/s00362-021-01276-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-021-01276-7