Sparse estimation via nonconcave penalized likelihood in factor analysis model

Abstract

We consider the problem of sparse estimation in a factor analysis model. A traditional estimation procedure in use is the following two-step approach: the model is estimated by maximum likelihood method and then a rotation technique is utilized to find sparse factor loadings. However, the maximum likelihood estimates cannot be obtained when the number of variables is much larger than the number of observations. Furthermore, even if the maximum likelihood estimates are available, the rotation technique does not often produce a sufficiently sparse solution. In order to handle these problems, this paper introduces a penalized likelihood procedure that imposes a nonconvex penalty on the factor loadings. We show that the penalized likelihood procedure can be viewed as a generalization of the traditional two-step approach, and the proposed methodology can produce sparser solutions than the rotation technique. A new algorithm via the EM algorithm along with coordinate descent is introduced to compute the entire solution path, which permits the application to a wide variety of convex and nonconvex penalties. Monte Carlo simulations are conducted to investigate the performance of our modeling strategy. A real data example is also given to illustrate our procedure.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) 2nd International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)

  2. Akaike, H.: Factor analysis and AIC. Psychometrika 52(3), 317–332 (1987)

    MathSciNet  Article  MATH  Google Scholar 

  3. Anderson, T., Rubin, H.: Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability vol. 5, pp. 111–150 (1956)

  4. Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  5. Bai, J., Liao, Y.: Efficient estimation of approximate factor models via regularized maximum likelihood. arXiv, preprint arXiv:12095911 (2012)

  6. Bozdogan, H.: Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3), 345–370 (1987)

    MathSciNet  Article  MATH  Google Scholar 

  7. Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1), 232 (2011)

    MathSciNet  Article  MATH  Google Scholar 

  8. Caner, M.: Selecting the correct number of factors in approximate factor models: the large panel case with bridge estimators. Technical report. (2011)

  9. Carvalho, C.M., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q.: West M High-dimensional sparse factor modeling: applications in gene expression genomics. J. American Stat. Assoc. 103(484), 1438–1456 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  10. Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  11. Choi, J., Zou, H., Oehlert, G.: A penalized maximum likelihood approach to sparse factor analysis. Stat. Interface 3(4), 429–436 (2011)

    MathSciNet  Article  Google Scholar 

  12. Clarke, M.: A rapidly convergent method for maximum-likelihood factor analysis. British J. Math. Stat. Psychol. 23(1), 43–52 (1970)

    Article  MATH  Google Scholar 

  13. Efron, B.: How biased is the apparent error rate of a prediction rule? J. American Stat. Assoc. 81, 461–470 (1986)

    MathSciNet  Article  MATH  Google Scholar 

  14. Efron, B.: The estimation of prediction error: covariance penalties and cross-validation. J. American Stat. Assoc. 99, 619–642 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  15. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32, 407–499 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  16. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. American Stat. Assoc. 96, 1348–1360 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  17. Frank, I., Friedman, J.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)

    Article  MATH  Google Scholar 

  18. Friedman, J., Hastie, H., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  19. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Google Scholar 

  20. Friedman, J.: Fast sparse regression and classification. Int. J. Forecast. 28(3), 722–738 (2012)

    Article  Google Scholar 

  21. Fu, W.: Penalized regression: the bridge versus the lasso. J. Comput. Graph. Stat. 7, 397–416 (1998)

    Google Scholar 

  22. Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)

    MathSciNet  MATH  Google Scholar 

  23. Hendrickson, A., White, P.: Promax: a quick method for rotation to oblique simple structure. British J. Stat. Psychol. 17(1), 65–70 (1964)

    Article  Google Scholar 

  24. Hirose, K., Konishi, S.: Variable selection via the weighted group lasso for factor analysis models. Canadian J. Stat. 40(2), 345–361 (2012)

    MathSciNet  Article  MATH  Google Scholar 

  25. Hirose, K., Tateishi, S., Konishi, S.: Tuning parameter selection in sparse regression modeling. Comput. Stat. Data Anal. (2012)

  26. Jennrich, R.: Rotation to simple loadings using component loss functions: the orthogonal case. Psychometrika 69(2), 257–273 (2004)

    MathSciNet  Article  Google Scholar 

  27. Jennrich, R.: Rotation to simple loadings using component loss functions: the oblique case. Psychometrika 71(1), 173–191 (2006)

    MathSciNet  Article  Google Scholar 

  28. Jennrich, R., Robinson, S.: A Newton–Raphson algorithm for maximum likelihood factor analysis. Psychometrika 34(1), 111–123 (1969)

    MathSciNet  Article  Google Scholar 

  29. Jöreskog, K.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)

    MathSciNet  Article  MATH  Google Scholar 

  30. Kaiser, H.: The varimax criterion for analytic rotation in factor analysis. Psychometrika 23(3), 187–200 (1958)

    Article  MATH  Google Scholar 

  31. Kato, K.: On the segrees of dreedom in shrinkage estimation. J. Multivar. Anal. 100, 1338–1352 (2009)

    Article  MATH  Google Scholar 

  32. Kiers, H.A.: Simplimax, oblique rotation to an optimal target with simple structure. Psychometrika 59(4), 567–579 (1994)

    MathSciNet  Article  MATH  Google Scholar 

  33. Mazumder, R., Friedman, J., Hastie, T.: Sparsenet: coordinate descent with nonconvex penalties. J. American Stat. Assoc. 106, 1125–1138 (2011)

    MathSciNet  Article  MATH  Google Scholar 

  34. Mulaik, S.: The Foundations of Factor Analysis, 2nd edn. Chapman and Hall/CRC, Boca Raton (2010)

    Google Scholar 

  35. Ning, L., Georgiou, T.T.: Sparse factor analysis via likelihood and \(\ell _1\) regularization. In: Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, pp 5188–5192 (2011)

  36. R Development Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org (2010), ISBN 3-900051-07-0

  37. Rubin, D., Thayer, D.: EM algorithms for ML factor analysis. Psychometrika 47(1), 69–76 (1982)

    MathSciNet  Article  MATH  Google Scholar 

  38. Schwarz, G.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1978)

    Google Scholar 

  39. Stock, J.H., Watson, M.W.: Forecasting using principal components from a large number of predictors. J. American Stat. Assoc. 97(460), 1167–1179 (2002)

    MathSciNet  Article  MATH  Google Scholar 

  40. Thurstone, L.L.: Multiple Factor Analysis. University of Chicago Press, Chicago (1947)

    MATH  Google Scholar 

  41. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  42. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Royal Stat. Soc. 61(3), 611–622 (1999)

    MathSciNet  Article  MATH  Google Scholar 

  43. Ulfarsson, M.O., Solo, V.: Sparse variable principal component analysis with application to fmri. In: Proceedings of the 4th IEEE International Symposium on Biomedical Imaging from Nano to Macro, ISBI 2007, pp 460–463 (2007)

  44. Xie, S., Krishnan, S., Lawniczak, A.T.: Sparse principal component extraction and classification of long-term biomedical signals. In: Proceedings of the IEEE 25th International Symposium on Computer-Based Medical Systems (CBMS), pp 1–6 (2012)

  45. Ye, J.: On measuring and correcting the effects of data mining and model selection. J. American Stat. Assoc. 93, 120–131 (1998)

    Article  MATH  Google Scholar 

  46. Yoshida, R., West, M.: Bayesian learning in sparse graphical factor models via variational mean-field annealing. J. Mach. Learn. Res. 99, 1771–1798 (2010)

  47. Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  48. Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)

    Article  MATH  Google Scholar 

  49. Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7(2), 2541 (2007)

    MathSciNet  Google Scholar 

  50. Zou, H.: The adaptive lasso and its oracle properties. J. American Stat. Assoc. 101, 1418–1429 (2006)

    Article  MATH  Google Scholar 

  51. Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509 (2008)

    MathSciNet  Article  MATH  Google Scholar 

  52. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    MathSciNet  Article  Google Scholar 

  53. Zou, H., Hastie, T., Tibshirani, R.: On the degrees of freedom of the lasso. Ann. Stat. 35, 2173–2192 (2007)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank anonymous reviewers for the constructive and helpful comments that improved the quality of the paper considerably. We also thank Professor Yutaka Kano for the helpful discussions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kei Hirose.

Appendices

Appendix A: Derivation of complete-data penalized log-likelihood function in EM algorithm

In order to apply the EM algorithm, first, the common factors \(\varvec{f}_n\) can be regarded as missing data and maximize the complete-data penalized log-likelihood function

$$\begin{aligned} {l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi }) = \sum _{n=1}^N \log f(\varvec{x}_n,\varvec{f}_n) - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|), \end{aligned}$$

where the density function \(f(\varvec{x}_n,\varvec{f}_n)\) is defined by

$$\begin{aligned} f(\varvec{x}_n,\varvec{f}_n)&= \prod _{i=1}^p \left\{ (2\pi \psi _i)^{-1/2} \exp \left( - \frac{ (x_{ni}-\varvec{\lambda }_i^T\varvec{f}_n )^2}{2\psi _i} \right) \right\} \\&\cdot \ (2\pi )^{-m/2}\exp \left( - \frac{\Vert \varvec{f}_n \Vert ^2}{2} \right) \end{aligned}$$

Then, the expectation of \({l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) \) can be taken with respect to the distributions \(f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }},\varvec{\varPsi })\),

$$\begin{aligned}&E[{l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi })]= \\&-\frac{N(p+m)}{2} \log (2\pi ) - \frac{N}{2} \sum _{i=1}^p \log \psi _i \\&- \frac{1}{2} \sum _{n=1}^N\sum _{i=1}^p \frac{x_{ni}^2 - 2x_{ni}\varvec{\lambda }_i^TE[\varvec{F}_n|\varvec{x}_n]+ \varvec{\lambda }_i^T E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n]\varvec{\lambda }_i}{\psi _i} \\&- \frac{1}{2} \hbox { tr} \left\{ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] \right\} - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|) \end{aligned}$$

For given \({\varvec{\varLambda }}_\text {old}\) and \(\varvec{\varPsi }_\text {old}\), the posterior \(f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }}_\text {old}, \varvec{\varPsi }_\text {old})\) is normally distributed with \(E[\varvec{F}_n|\varvec{x}_n] = \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1} \varvec{x}_n\) and \(E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] = \varvec{M} ^{-1} + E[\varvec{F}_n|\varvec{x}_n] E[\varvec{F}_n|\varvec{x}_n] ^T\), where \(\varvec{M} = {\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old} + \varvec{I}_m\). Then, we have

$$\begin{aligned} \sum _{n=1}^N E[\varvec{F}_n]x_{ni}&= N\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i,\\ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T]&= N (\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}). \end{aligned}$$

Let \(\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i\) and \(\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}\) be \(\varvec{b}_i\) and \(\varvec{A}\), respectively. Then, the expectation of \({l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) \) in (7) can be derived.

Appendix B: proof of Lemma 1

The proof is by contradiction. Assume that \(\hat{{\varvec{\varLambda }}}\) and \(\hat{\varvec{\varPsi }}\) are the solution of (6) and the \(j\)th column of \(\hat{{\varvec{\varLambda }}}\) has only one nonzero element, say, \(\hat{\lambda }_{aj}\). Another parameter \(\hat{{\varvec{\varLambda }}}^*\) and \(\hat{\varvec{\varPsi }}^*\) are defined, where \(\hat{{\varvec{\varLambda }}}^*\) is same as \(\hat{{\varvec{\varLambda }}}\) but with the \(aj\)th element being zero and \(\hat{\varvec{\varPsi }}^*\) is same as \(\hat{\varvec{\varPsi }}\) but with the \(j\)th diagonal element being \(\hat{\psi }_j+\hat{\lambda }_{aj}^2\). In this case, we have the same covariance structure, i.e., \(\hat{{\varvec{\varLambda }}}\hat{{\varvec{\varLambda }}}^T+\hat{\varvec{\varPsi }}=\hat{{\varvec{\varLambda }}}^*\hat{{\varvec{\varLambda }}}^{*T}+\hat{\varvec{\varPsi }}^*\), which suggests \(\ell (\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) = \ell (\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)\), whereas the penalty term of \( \sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }_{ij}|)\) is larger than \(\sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }^*_{ij}|)\). This means \(\ell _{\rho }(\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) < \ell _{\rho }(\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)\), which contradicts the assumption that \(\hat{{\varvec{\varLambda }}}\) and \(\hat{\varvec{\varPsi }}\) are penalized maximum likelihood estimates.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hirose, K., Yamamoto, M. Sparse estimation via nonconcave penalized likelihood in factor analysis model. Stat Comput 25, 863–875 (2015). https://doi.org/10.1007/s11222-014-9458-0

Download citation

Keywords

  • Coordinate descent algorithm
  • Factor analysis
  • Nonconvex penalty
  • Penalized likelihood
  • Rotation technique