Skip to main content
Log in

A generalized information criterion for high-dimensional PCA rank selection

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Principal component analysis (PCA) is a commonly used statistical tool for dimension reduction. An important issue in PCA is to determine the rank, which is the number of dominant eigenvalues of the covariance matrix. Among information-based criteria, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are the two most common ones. Both use the number of free parameters for assessing model complexity, which requires the validity of the simple spiked covariance model. As a result, AIC and BIC may suffer from the problem of model misspecification when the tail eigenvalues do not follow the simple spiked model assumption. To alleviate this difficulty, we adopt the idea of the generalized information criterion (GIC) to propose a model complexity measure for PCA rank selection. The proposed model complexity takes into account the sizes of eigenvalues and, hence, is more robust to model misspecification. Asymptotic properties of our GIC are established under the high-dimensional setting, where \(n\rightarrow \infty \) and \(p/n\rightarrow c >0\). Our asymptotic results show that GIC is better than AIC in excluding noise eigenvalues, and is more sensitive than BIC in detecting signal eigenvalues. Numerical studies and a real data example are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70:191–221

    Article  MathSciNet  Google Scholar 

  • Bai ZD, Silverstein JW (2010) Spectral analysis of large dimensional random matrices, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Bai ZD, Yao J (2012) On sample eigenvalues in a generalized spiked population model. J Multivar Anal 106:167–177

    Article  MathSciNet  Google Scholar 

  • Bai Z, Chen J, Yao J (2010) On estimation of the population spectral distribution from a high-dimensional sample covariance matrix. Aust N Z J Stat 52:423–437

    Article  MathSciNet  Google Scholar 

  • Bai ZD, Choi KP, Fujikoshi Y (2018) Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. Ann Stat 46:1050–1076

    MathSciNet  MATH  Google Scholar 

  • Chan T, Jia K, Gao S, Lu J, Zeng Z, Ma Y (2015) PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Process 24:5017–5032

    Article  MathSciNet  Google Scholar 

  • Chung SC, Wang SH, Niu PY, Huang SY, Chang WH, Tu IP (2020) Two-stage dimension reduction for noisy high-dimensional images and application to Cryogenic Electron Microscopy. Ann Math Sci Appl 5:283–316

    Article  MathSciNet  Google Scholar 

  • Ding Q, Kolaczyk ED (2013) A compressed PCA subspace method for anomaly detection in high-dimensional data. IEEE Trans Inf Theory 59:7419–7433

    Article  Google Scholar 

  • Fujikoshi Y, Sakurai T (2016) Some properties of estimation criteria for dimensionality in principal component analysis. Am J Math Manag Sci 35:133–142

    Google Scholar 

  • Gorban AN, Kégl B, Wunsch DC, Zinovyev AY (eds) (2008) Principal manifolds for data visualization and dimension reduction. Springer, Berlin

    Google Scholar 

  • Gottumukkal R, Asari VK (2004) An improved face recognition technique based on modular PCA approach. Pattern Recogn Lett 25(4):429–436

    Article  Google Scholar 

  • Hsu HL, Ing CK, Tong H (2019) On model selection from a finite family of possibly misspecified time series models. Ann Stat 47:1061–1087

    Article  MathSciNet  Google Scholar 

  • Jiang Q, Yan X, Zhao W (2013) Fault detection and diagnosis in chemical processes using sensitive principal component analysis. Ind Eng Chem Res 52:1635–1644

    Article  Google Scholar 

  • Jiang Q, Huang B, Yan X (2016) GMM and optimal principal components-based Bayesian method for multimode fault diagnosis. Comput Chem Eng 84:338–349

    Article  Google Scholar 

  • Konishi S, Kitagawa G (1996) Generalized information criteria in model selection. Biometrika 83:875–890

    Article  MathSciNet  Google Scholar 

  • Lv J, Liu JS (2014) Model selection principles in misspecified models. J R Stat Soc B 76:141–167

    Article  MathSciNet  Google Scholar 

  • Wax M, Kailath T (1985) Detection of signals by information theoretic criteria. IEEE Trans Acoust Speech Signal Process 33:387–392

    Article  MathSciNet  Google Scholar 

  • Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY, Keilbaugh SA, Bewtra M, Knights D, Walters WA, Knight R, Sinha R, Gilroy E, Gupta K, Baldassano R, Nessel L, Li H, Bushman FD, Lewis JD (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334:105–108

    Article  Google Scholar 

  • Yao J, Zheng S, Bai Z (2015) Large sample covariance matrices and high-dimensional data analysis. Cambridge University Press, New York

    Book  Google Scholar 

  • Zhang L, Dong W, Zhang D, Shi G (2010) Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recogn 43:1531–1549

    Article  Google Scholar 

  • Zheng Z, Lv J, Lin W (2020) Nonsparse learning with latent variables. Oper Res (to appear)

Download references

Funding

Funding was provided by Ministry of Science and Technology, Taiwan. (110-2118-M-002-001-MY3 for HH; MOST 107-2118-M-001-012-MY3 for SYH).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hung Hung.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1979 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hung, H., Huang, SY. & Ing, CK. A generalized information criterion for high-dimensional PCA rank selection. Stat Papers 63, 1295–1321 (2022). https://doi.org/10.1007/s00362-021-01276-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-021-01276-7

Keywords

Navigation