Sparse estimation via nonconcave penalized likelihood in factor analysis model

Hirose, Kei; Yamamoto, Michio

doi:10.1007/s11222-014-9458-0

Sparse estimation via nonconcave penalized likelihood in factor analysis model

Published: 28 May 2014

Volume 25, pages 863–875, (2015)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Kei Hirose¹ &
Michio Yamamoto²

1753 Accesses
45 Citations
Explore all metrics

Abstract

We consider the problem of sparse estimation in a factor analysis model. A traditional estimation procedure in use is the following two-step approach: the model is estimated by maximum likelihood method and then a rotation technique is utilized to find sparse factor loadings. However, the maximum likelihood estimates cannot be obtained when the number of variables is much larger than the number of observations. Furthermore, even if the maximum likelihood estimates are available, the rotation technique does not often produce a sufficiently sparse solution. In order to handle these problems, this paper introduces a penalized likelihood procedure that imposes a nonconvex penalty on the factor loadings. We show that the penalized likelihood procedure can be viewed as a generalization of the traditional two-step approach, and the proposed methodology can produce sparser solutions than the rotation technique. A new algorithm via the EM algorithm along with coordinate descent is introduced to compute the entire solution path, which permits the application to a wide variety of convex and nonconvex penalties. Monte Carlo simulations are conducted to investigate the performance of our modeling strategy. A real data example is also given to illustrate our procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

Jörg Henseler, Christian M. Ringle & Marko Sarstedt

Confidence distributions and hypothesis testing

Article Open access 29 March 2024

Eugenio Melilli & Piero Veronese

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Levi Kumle, Melissa L.-H. Võ & Dejan Draschkow

References

Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) 2nd International Symposium on Information Theory, pp. 267–281. Akademiai Kiado, Budapest (1973)
Akaike, H.: Factor analysis and AIC. Psychometrika 52(3), 317–332 (1987)
Article MathSciNet MATH Google Scholar
Anderson, T., Rubin, H.: Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability vol. 5, pp. 111–150 (1956)
Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)
Article MathSciNet MATH Google Scholar
Bai, J., Liao, Y.: Efficient estimation of approximate factor models via regularized maximum likelihood. arXiv, preprint arXiv:12095911 (2012)
Bozdogan, H.: Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52(3), 345–370 (1987)
Article MathSciNet MATH Google Scholar
Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1), 232 (2011)
Article MathSciNet MATH Google Scholar
Caner, M.: Selecting the correct number of factors in approximate factor models: the large panel case with bridge estimators. Technical report. (2011)
Carvalho, C.M., Chang, J., Lucas, J.E., Nevins, J.R., Wang, Q.: West M High-dimensional sparse factor modeling: applications in gene expression genomics. J. American Stat. Assoc. 103(484), 1438–1456 (2008)
Article MathSciNet MATH Google Scholar
Chen, J., Chen, Z.: Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771 (2008)
Article MathSciNet MATH Google Scholar
Choi, J., Zou, H., Oehlert, G.: A penalized maximum likelihood approach to sparse factor analysis. Stat. Interface 3(4), 429–436 (2011)
Article MathSciNet Google Scholar
Clarke, M.: A rapidly convergent method for maximum-likelihood factor analysis. British J. Math. Stat. Psychol. 23(1), 43–52 (1970)
Article MATH Google Scholar
Efron, B.: How biased is the apparent error rate of a prediction rule? J. American Stat. Assoc. 81, 461–470 (1986)
Article MathSciNet MATH Google Scholar
Efron, B.: The estimation of prediction error: covariance penalties and cross-validation. J. American Stat. Assoc. 99, 619–642 (2004)
Article MathSciNet MATH Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression (with discussion). Ann. Stat. 32, 407–499 (2004)
Article MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. American Stat. Assoc. 96, 1348–1360 (2001)
Article MathSciNet MATH Google Scholar
Frank, I., Friedman, J.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)
Article MATH Google Scholar
Friedman, J., Hastie, H., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Google Scholar
Friedman, J.: Fast sparse regression and classification. Int. J. Forecast. 28(3), 722–738 (2012)
Article Google Scholar
Fu, W.: Penalized regression: the bridge versus the lasso. J. Comput. Graph. Stat. 7, 397–416 (1998)
Google Scholar
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)
MathSciNet MATH Google Scholar
Hendrickson, A., White, P.: Promax: a quick method for rotation to oblique simple structure. British J. Stat. Psychol. 17(1), 65–70 (1964)
Article Google Scholar
Hirose, K., Konishi, S.: Variable selection via the weighted group lasso for factor analysis models. Canadian J. Stat. 40(2), 345–361 (2012)
Article MathSciNet MATH Google Scholar
Hirose, K., Tateishi, S., Konishi, S.: Tuning parameter selection in sparse regression modeling. Comput. Stat. Data Anal. (2012)
Jennrich, R.: Rotation to simple loadings using component loss functions: the orthogonal case. Psychometrika 69(2), 257–273 (2004)
Article MathSciNet Google Scholar
Jennrich, R.: Rotation to simple loadings using component loss functions: the oblique case. Psychometrika 71(1), 173–191 (2006)
Article MathSciNet Google Scholar
Jennrich, R., Robinson, S.: A Newton–Raphson algorithm for maximum likelihood factor analysis. Psychometrika 34(1), 111–123 (1969)
Article MathSciNet Google Scholar
Jöreskog, K.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)
Article MathSciNet MATH Google Scholar
Kaiser, H.: The varimax criterion for analytic rotation in factor analysis. Psychometrika 23(3), 187–200 (1958)
Article MATH Google Scholar
Kato, K.: On the segrees of dreedom in shrinkage estimation. J. Multivar. Anal. 100, 1338–1352 (2009)
Article MATH Google Scholar
Kiers, H.A.: Simplimax, oblique rotation to an optimal target with simple structure. Psychometrika 59(4), 567–579 (1994)
Article MathSciNet MATH Google Scholar
Mazumder, R., Friedman, J., Hastie, T.: Sparsenet: coordinate descent with nonconvex penalties. J. American Stat. Assoc. 106, 1125–1138 (2011)
Article MathSciNet MATH Google Scholar
Mulaik, S.: The Foundations of Factor Analysis, 2nd edn. Chapman and Hall/CRC, Boca Raton (2010)
Google Scholar
Ning, L., Georgiou, T.T.: Sparse factor analysis via likelihood and $\ell _1$ regularization. In: Proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference, pp 5188–5192 (2011)
R Development Core Team.: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org (2010), ISBN 3-900051-07-0
Rubin, D., Thayer, D.: EM algorithms for ML factor analysis. Psychometrika 47(1), 69–76 (1982)
Article MathSciNet MATH Google Scholar
Schwarz, G.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1978)
Google Scholar
Stock, J.H., Watson, M.W.: Forecasting using principal components from a large number of predictors. J. American Stat. Assoc. 97(460), 1167–1179 (2002)
Article MathSciNet MATH Google Scholar
Thurstone, L.L.: Multiple Factor Analysis. University of Chicago Press, Chicago (1947)
MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Royal Stat. Soc. 61(3), 611–622 (1999)
Article MathSciNet MATH Google Scholar
Ulfarsson, M.O., Solo, V.: Sparse variable principal component analysis with application to fmri. In: Proceedings of the 4th IEEE International Symposium on Biomedical Imaging from Nano to Macro, ISBI 2007, pp 460–463 (2007)
Xie, S., Krishnan, S., Lawniczak, A.T.: Sparse principal component extraction and classification of long-term biomedical signals. In: Proceedings of the IEEE 25th International Symposium on Computer-Based Medical Systems (CBMS), pp 1–6 (2012)
Ye, J.: On measuring and correcting the effects of data mining and model selection. J. American Stat. Assoc. 93, 120–131 (1998)
Article MATH Google Scholar
Yoshida, R., West, M.: Bayesian learning in sparse graphical factor models via variational mean-field annealing. J. Mach. Learn. Res. 99, 1771–1798 (2010)
Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)
Article MathSciNet MATH Google Scholar
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)
Article MATH Google Scholar
Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7(2), 2541 (2007)
MathSciNet Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. American Stat. Assoc. 101, 1418–1429 (2006)
Article MATH Google Scholar
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509 (2008)
Article MathSciNet MATH Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)
Article MathSciNet Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: On the degrees of freedom of the lasso. Ann. Stat. 35, 2173–2192 (2007)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank anonymous reviewers for the constructive and helpful comments that improved the quality of the paper considerably. We also thank Professor Yutaka Kano for the helpful discussions.

Author information

Authors and Affiliations

Division of Mathematical Science, Graduate School of Engineering Science, Osaka University, 1-3, Machikaneyama-cho, Toyonaka, Osaka, 560-8531, Japan
Kei Hirose
Department of Biomedical Statistics and Bioinformatics, Kyoto University Graduate School of Medicine, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto, 606-8507, Japan
Michio Yamamoto

Authors

Kei Hirose
View author publications
You can also search for this author in PubMed Google Scholar
Michio Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kei Hirose.

Appendices

Appendix A: Derivation of complete-data penalized log-likelihood function in EM algorithm

In order to apply the EM algorithm, first, the common factors $\varvec{f}_n$ can be regarded as missing data and maximize the complete-data penalized log-likelihood function

$$\begin{aligned} {l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi }) = \sum _{n=1}^N \log f(\varvec{x}_n,\varvec{f}_n) - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|), \end{aligned}$$

where the density function $f(\varvec{x}_n,\varvec{f}_n)$ is defined by

$$\begin{aligned} f(\varvec{x}_n,\varvec{f}_n)&= \prod _{i=1}^p \left\{ (2\pi \psi _i)^{-1/2} \exp \left( - \frac{ (x_{ni}-\varvec{\lambda }_i^T\varvec{f}_n )^2}{2\psi _i} \right) \right\} \\&\cdot \ (2\pi )^{-m/2}\exp \left( - \frac{\Vert \varvec{f}_n \Vert ^2}{2} \right) \end{aligned}$$

Then, the expectation of ${l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) $ can be taken with respect to the distributions $f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }},\varvec{\varPsi })$,

$$\begin{aligned}&E[{l}_{\rho }^{C} ({\varvec{\varLambda }},\varvec{\varPsi })]= \\&-\frac{N(p+m)}{2} \log (2\pi ) - \frac{N}{2} \sum _{i=1}^p \log \psi _i \\&- \frac{1}{2} \sum _{n=1}^N\sum _{i=1}^p \frac{x_{ni}^2 - 2x_{ni}\varvec{\lambda }_i^TE[\varvec{F}_n|\varvec{x}_n]+ \varvec{\lambda }_i^T E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n]\varvec{\lambda }_i}{\psi _i} \\&- \frac{1}{2} \hbox { tr} \left\{ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] \right\} - N \sum _{i=1}^p\sum _{j=1}^m \rho P(|\lambda _{ij}|) \end{aligned}$$

For given ${\varvec{\varLambda }}_\text {old}$ and $\varvec{\varPsi }_\text {old}$, the posterior $f(\varvec{f}_n | \varvec{x}_n,{\varvec{\varLambda }}_\text {old}, \varvec{\varPsi }_\text {old})$ is normally distributed with $E[\varvec{F}_n|\varvec{x}_n] = \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1} \varvec{x}_n$ and $E[\varvec{F}_n\varvec{F}_n^T|\varvec{x}_n] = \varvec{M} ^{-1} + E[\varvec{F}_n|\varvec{x}_n] E[\varvec{F}_n|\varvec{x}_n] ^T$, where $\varvec{M} = {\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old} + \varvec{I}_m$. Then, we have

$$\begin{aligned} \sum _{n=1}^N E[\varvec{F}_n]x_{ni}&= N\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i,\\ \sum _{n=1}^N E[\varvec{F}_n\varvec{F}_n^T]&= N (\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}). \end{aligned}$$

Let $\varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{s}_i$ and $\varvec{M} ^{-1} + \varvec{M}^{-1}{\varvec{\varLambda }}_\text {old}^T\varvec{\varPsi }_\text {old}^{-1}\varvec{S}\varvec{\varPsi }_\text {old}^{-1}{\varvec{\varLambda }}_\text {old}\varvec{M}^{-1}$ be $\varvec{b}_i$ and $\varvec{A}$, respectively. Then, the expectation of ${l}_{\rho }^{C}({\varvec{\varLambda }},\varvec{\varPsi }) $ in (7) can be derived.

Appendix B: proof of Lemma 1

The proof is by contradiction. Assume that $\hat{{\varvec{\varLambda }}}$ and $\hat{\varvec{\varPsi }}$ are the solution of (6) and the $j$th column of $\hat{{\varvec{\varLambda }}}$ has only one nonzero element, say, $\hat{\lambda }_{aj}$. Another parameter $\hat{{\varvec{\varLambda }}}^*$ and $\hat{\varvec{\varPsi }}^*$ are defined, where $\hat{{\varvec{\varLambda }}}^*$ is same as $\hat{{\varvec{\varLambda }}}$ but with the $aj$th element being zero and $\hat{\varvec{\varPsi }}^*$ is same as $\hat{\varvec{\varPsi }}$ but with the $j$th diagonal element being $\hat{\psi }_j+\hat{\lambda }_{aj}^2$. In this case, we have the same covariance structure, i.e., $\hat{{\varvec{\varLambda }}}\hat{{\varvec{\varLambda }}}^T+\hat{\varvec{\varPsi }}=\hat{{\varvec{\varLambda }}}^*\hat{{\varvec{\varLambda }}}^{*T}+\hat{\varvec{\varPsi }}^*$, which suggests $\ell (\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) = \ell (\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)$, whereas the penalty term of $ \sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }_{ij}|)$ is larger than $\sum _{i=1}^p\sum _{j=1}^m \rho P(|\hat{\lambda }^*_{ij}|)$. This means $\ell _{\rho }(\hat{{\varvec{\varLambda }}}, \hat{\varvec{\varPsi }}) < \ell _{\rho }(\hat{{\varvec{\varLambda }}}^*, \hat{\varvec{\varPsi }}^*)$, which contradicts the assumption that $\hat{{\varvec{\varLambda }}}$ and $\hat{\varvec{\varPsi }}$ are penalized maximum likelihood estimates.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hirose, K., Yamamoto, M. Sparse estimation via nonconcave penalized likelihood in factor analysis model. Stat Comput 25, 863–875 (2015). https://doi.org/10.1007/s11222-014-9458-0

Download citation

Received: 18 March 2013
Accepted: 23 February 2014
Published: 28 May 2014
Issue Date: September 2015
DOI: https://doi.org/10.1007/s11222-014-9458-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse estimation via nonconcave penalized likelihood in factor analysis model

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Confidence distributions and hypothesis testing

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Derivation of complete-data penalized log-likelihood function in EM algorithm

Appendix B: proof of Lemma 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sparse estimation via nonconcave penalized likelihood in factor analysis model

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Confidence distributions and hypothesis testing

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Derivation of complete-data penalized log-likelihood function in EM algorithm

Appendix B: proof of Lemma 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation