Skip to main content

Sparse estimation via nonconcave penalized likelihood in factor analysis model

Abstract

We consider the problem of sparse estimation in a factor analysis model. A traditional estimation procedure in use is the following two-step approach: the model is estimated by maximum likelihood method and then a rotation technique is utilized to find sparse factor loadings. However, the maximum likelihood estimates cannot be obtained when the number of variables is much larger than the number of observations. Furthermore, even if the maximum likelihood estimates are available, the rotation technique does not often produce a sufficiently sparse solution. In order to handle these problems, this paper introduces a penalized likelihood procedure that imposes a nonconvex penalty on the factor loadings. We show that the penalized likelihood procedure can be viewed as a generalization of the traditional two-step approach, and the proposed methodology can produce sparser solutions than the rotation technique. A new algorithm via the EM algorithm along with coordinate descent is introduced to compute the entire solution path, which permits the application to a wide variety of convex and nonconvex penalties. Monte Carlo simulations are conducted to investigate the performance of our modeling strategy. A real data example is also given to illustrate our procedure.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  • Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) 2nd International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)

  • Akaike, H.: Factor analysis and AIC. Psychometrika 52(3), 317–332 (1987)

    MathSciNet  Article  MATH  Google Scholar 

  • Anderson, T., Rubin, H.: Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability vol. 5, pp. 111–150 (1956)

  • Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  • Bai, J., Liao, Y.: Efficient estimation of approximate factor models via regularized maximum likelihood. arXiv, preprint arXiv:12095911 (2012)

  • Bozdogan, H.: Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3), 345–370 (1987)

    MathSciNet  Article  MATH  Google Scholar 

  • Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1), 232 (2011)

    MathSciNet  Article  MATH  Google Scholar 

  • Caner, M.: Selecting the correct number of factors in approximate factor models: the large panel case with bridge estimators. Technical report. (2011)

  • Carvalho, C.M., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q.: West M High-dimensional sparse factor modeling: applications in gene expression genomics. J. American Stat. Assoc. 103(484), 1438–1456 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  • Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  • Choi, J., Zou, H., Oehlert, G.: A penalized maximum likelihood approach to sparse factor analysis. Stat. Interface 3(4), 429–436 (2011)

    MathSciNet  Article  Google Scholar 

  • Clarke, M.: A rapidly convergent method for maximum-likelihood factor analysis. British J. Math. Stat. Psychol. 23(1), 43–52 (1970)

    Article  MATH  Google Scholar 

  • Efron, B.: How biased is the apparent error rate of a prediction rule? J. American Stat. Assoc. 81, 461–470 (1986)

    MathSciNet  Article  MATH  Google Scholar 

  • Efron, B.: The estimation of prediction error: covariance penalties and cross-validation. J. American Stat. Assoc. 99, 619–642 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32, 407–499 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. American Stat. Assoc. 96, 1348–1360 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  • Frank, I., Friedman, J.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)

    Article  MATH  Google Scholar 

  • Friedman, J., Hastie, H., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Google Scholar 

  • Friedman, J.: Fast sparse regression and classification. Int. J. Forecast. 28(3), 722–738 (2012)

    Article  Google Scholar 

  • Fu, W.: Penalized regression: the bridge versus the lasso. J. Comput. Graph. Stat. 7, 397–416 (1998)

    Google Scholar 

  • Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)

    MathSciNet  MATH  Google Scholar 

  • Hendrickson, A., White, P.: Promax: a quick method for rotation to oblique simple structure. British J. Stat. Psychol. 17(1), 65–70 (1964)

    Article  Google Scholar 

  • Hirose, K., Konishi, S.: Variable selection via the weighted group lasso for factor analysis models. Canadian J. Stat. 40(2), 345–361 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  • Hirose, K., Tateishi, S., Konishi, S.: Tuning parameter selection in sparse regression modeling. Comput. Stat. Data Anal. (2012)

  • Jennrich, R.: Rotation to simple loadings using component loss functions: the orthogonal case. Psychometrika 69(2), 257–273 (2004)

    MathSciNet  Article  Google Scholar 

  • Jennrich, R.: Rotation to simple loadings using component loss functions: the oblique case. Psychometrika 71(1), 173–191 (2006)

    MathSciNet  Article  Google Scholar 

  • Jennrich, R., Robinson, S.: A Newton–Raphson algorithm for maximum likelihood factor analysis. Psychometrika 34(1), 111–123 (1969)

    MathSciNet  Article  Google Scholar 

  • Jöreskog, K.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)

    MathSciNet  Article  MATH  Google Scholar 

  • Kaiser, H.: The varimax criterion for analytic rotation in factor analysis. Psychometrika 23(3), 187–200 (1958)

    Article  MATH  Google Scholar 

  • Kato, K.: On the segrees of dreedom in shrinkage estimation. J. Multivar. Anal. 100, 1338–1352 (2009)

    Article  MATH  Google Scholar 

  • Kiers, H.A.: Simplimax, oblique rotation to an optimal target with simple structure. Psychometrika 59(4), 567–579 (1994)

    MathSciNet  Article  MATH  Google Scholar 

  • Mazumder, R., Friedman, J., Hastie, T.: Sparsenet: coordinate descent with nonconvex penalties. J. American Stat. Assoc. 106, 1125–1138 (2011)

    MathSciNet  Article  MATH  Google Scholar 

  • Mulaik, S.: The Foundations of Factor Analysis, 2nd edn. Chapman and Hall/CRC, Boca Raton (2010)

    Google Scholar 

  • Ning, L., Georgiou, T.T.: Sparse factor analysis via likelihood and \(\ell _1\) regularization. In: Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, pp 5188–5192 (2011)

  • R Development Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org (2010), ISBN 3-900051-07-0

  • Rubin, D., Thayer, D.: EM algorithms for ML factor analysis. Psychometrika 47(1), 69–76 (1982)

    MathSciNet  Article  MATH  Google Scholar 

  • Schwarz, G.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1978)

    Google Scholar 

  • Stock, J.H., Watson, M.W.: Forecasting using principal components from a large number of predictors. J. American Stat. Assoc. 97(460), 1167–1179 (2002)

    MathSciNet  Article  MATH  Google Scholar 

  • Thurstone, L.L.: Multiple Factor Analysis. University of Chicago Press, Chicago (1947)

    MATH  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Royal Stat. Soc. 61(3), 611–622 (1999)

    MathSciNet  Article  MATH  Google Scholar 

  • Ulfarsson, M.O., Solo, V.: Sparse variable principal component analysis with application to fmri. In: Proceedings of the 4th IEEE International Symposium on Biomedical Imaging from Nano to Macro, ISBI 2007, pp 460–463 (2007)

  • Xie, S., Krishnan, S., Lawniczak, A.T.: Sparse principal component extraction and classification of long-term biomedical signals. In: Proceedings of the IEEE 25th International Symposium on Computer-Based Medical Systems (CBMS), pp 1–6 (2012)

  • Ye, J.: On measuring and correcting the effects of data mining and model selection. J. American Stat. Assoc. 93, 120–131 (1998)

    Article  MATH  Google Scholar 

  • Yoshida, R., West, M.: Bayesian learning in sparse graphical factor models via variational mean-field annealing. J. Mach. Learn. Res. 99, 1771–1798 (2010)

  • Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  • Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)

    Article  MATH  Google Scholar 

  • Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7(2), 2541 (2007)

    MathSciNet  Google Scholar 

  • Zou, H.: The adaptive lasso and its oracle properties. J. American Stat. Assoc. 101, 1418–1429 (2006)

    Article  MATH  Google Scholar 

  • Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  • Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    MathSciNet  Article  Google Scholar 

  • Zou, H., Hastie, T., Tibshirani, R.: On the degrees of freedom of the lasso. Ann. Stat. 35, 2173–2192 (2007)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank anonymous reviewers for the constructive and helpful comments that improved the quality of the paper considerably. We also thank Professor Yutaka Kano for the helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kei Hirose.

Appendices

Appendix A: Derivation of complete-data penalized log-likelihood function in EM algorithm

In order to apply the EM algorithm, first, the common factors \(\varvec{f}_n\) can be regarded as missing data and maximize the complete-data penalized log-likelihood function

$$\begin{aligned} {l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi }) = \sum _{n=1}^N \log f(\varvec{x}_n,\varvec{f}_n) - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|), \end{aligned}$$

where the density function \(f(\varvec{x}_n,\varvec{f}_n)\) is defined by

$$\begin{aligned} f(\varvec{x}_n,\varvec{f}_n)&= \prod _{i=1}^p \left\{ (2\pi \psi _i)^{-1/2} \exp \left( - \frac{ (x_{ni}-\varvec{\lambda }_i^T\varvec{f}_n )^2}{2\psi _i} \right) \right\} \\&\cdot \ (2\pi )^{-m/2}\exp \left( - \frac{\Vert \varvec{f}_n \Vert ^2}{2} \right) \end{aligned}$$

Then, the expectation of \({l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) \) can be taken with respect to the distributions \(f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }},\varvec{\varPsi })\),

$$\begin{aligned}&E[{l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi })]= \\&-\frac{N(p+m)}{2} \log (2\pi ) - \frac{N}{2} \sum _{i=1}^p \log \psi _i \\&- \frac{1}{2} \sum _{n=1}^N\sum _{i=1}^p \frac{x_{ni}^2 - 2x_{ni}\varvec{\lambda }_i^TE[\varvec{F}_n|\varvec{x}_n]+ \varvec{\lambda }_i^T E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n]\varvec{\lambda }_i}{\psi _i} \\&- \frac{1}{2} \hbox { tr} \left\{ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] \right\} - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|) \end{aligned}$$

For given \({\varvec{\varLambda }}_\text {old}\) and \(\varvec{\varPsi }_\text {old}\), the posterior \(f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }}_\text {old}, \varvec{\varPsi }_\text {old})\) is normally distributed with \(E[\varvec{F}_n|\varvec{x}_n] = \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1} \varvec{x}_n\) and \(E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] = \varvec{M} ^{-1} + E[\varvec{F}_n|\varvec{x}_n] E[\varvec{F}_n|\varvec{x}_n] ^T\), where \(\varvec{M} = {\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old} + \varvec{I}_m\). Then, we have

$$\begin{aligned} \sum _{n=1}^N E[\varvec{F}_n]x_{ni}&= N\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i,\\ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T]&= N (\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}). \end{aligned}$$

Let \(\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i\) and \(\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}\) be \(\varvec{b}_i\) and \(\varvec{A}\), respectively. Then, the expectation of \({l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) \) in (7) can be derived.

Appendix B: proof of Lemma 1

The proof is by contradiction. Assume that \(\hat{{\varvec{\varLambda }}}\) and \(\hat{\varvec{\varPsi }}\) are the solution of (6) and the \(j\)th column of \(\hat{{\varvec{\varLambda }}}\) has only one nonzero element, say, \(\hat{\lambda }_{aj}\). Another parameter \(\hat{{\varvec{\varLambda }}}^*\) and \(\hat{\varvec{\varPsi }}^*\) are defined, where \(\hat{{\varvec{\varLambda }}}^*\) is same as \(\hat{{\varvec{\varLambda }}}\) but with the \(aj\)th element being zero and \(\hat{\varvec{\varPsi }}^*\) is same as \(\hat{\varvec{\varPsi }}\) but with the \(j\)th diagonal element being \(\hat{\psi }_j+\hat{\lambda }_{aj}^2\). In this case, we have the same covariance structure, i.e., \(\hat{{\varvec{\varLambda }}}\hat{{\varvec{\varLambda }}}^T+\hat{\varvec{\varPsi }}=\hat{{\varvec{\varLambda }}}^*\hat{{\varvec{\varLambda }}}^{*T}+\hat{\varvec{\varPsi }}^*\), which suggests \(\ell (\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) = \ell (\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)\), whereas the penalty term of \( \sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }_{ij}|)\) is larger than \(\sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }^*_{ij}|)\). This means \(\ell _{\rho }(\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) < \ell _{\rho }(\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)\), which contradicts the assumption that \(\hat{{\varvec{\varLambda }}}\) and \(\hat{\varvec{\varPsi }}\) are penalized maximum likelihood estimates.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hirose, K., Yamamoto, M. Sparse estimation via nonconcave penalized likelihood in factor analysis model. Stat Comput 25, 863–875 (2015). https://doi.org/10.1007/s11222-014-9458-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9458-0

Keywords

  • Coordinate descent algorithm
  • Factor analysis
  • Nonconvex penalty
  • Penalized likelihood
  • Rotation technique