Skip to main content
Log in

Computation of the maximum likelihood estimator in low-rank factor analysis

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

Factor analysis is a classical multivariate dimensionality reduction technique popularly used in statistics, econometrics and data science. Estimation for factor analysis is often carried out via the maximum likelihood principle, which seeks to maximize the Gaussian likelihood under the assumption that the positive definite covariance matrix can be decomposed as the sum of a low-rank positive semidefinite matrix and a diagonal matrix with nonnegative entries. This leads to a challenging rank constrained nonconvex optimization problem, for which very few reliable computational algorithms are available. We reformulate the low-rank maximum likelihood factor analysis task as a nonlinear nonsmooth semidefinite optimization problem, study various structural properties of this reformulation; and propose fast and scalable algorithms based on difference of convex optimization. Our approach has computational guarantees, gracefully scales to large problems, is applicable to situations where the sample covariance matrix is rank deficient and adapts to variants of the maximum likelihood problem with additional constraints on the model parameters. Our numerical experiments validate the usefulness of our approach over existing state-of-the-art approaches for maximum likelihood factor analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Indeed, \(\varvec{\Psi }\succeq \epsilon \mathbf {I}\) implies that \(\varvec{\Sigma }\succeq \epsilon \mathbf {I} \succ \mathbf {0}\). Thus, \(-\log \det (\varvec{\Sigma }^{-1}) \ge p\log (\epsilon )\) and \( \mathrm {tr}\left( \varvec{\Sigma }^{-1} \mathbf {S}\right) \ge 0\) which shows that Problem (2) is bounded below. Note that Problem (2) with \(\epsilon =0\) need not have a finite solution, i.e., the ML solution need not exist. Note that if for some i, we have \(\psi _i \rightarrow \infty \) then \({{\mathcal {L}}}(\varvec{\Sigma }) \rightarrow \infty \), a similar argument applies if \(\mathbf {L}\mathbf {L}^\top \) becomes unbounded. Thus the infimum of Problem (2) is attained when \(\epsilon >0\).

  2. We have observed this in our experiments and they are reported in our section on numerical experiments.

  3. A function \(g(y_{1}, \ldots , y_{p}) : \mathfrak {R}^{p} \rightarrow \mathfrak {R}\) is said to be symmetric in its arguments if, for any permutation \(\pi \) of the indices \(\{1, \ldots , p \}\), we have \(g(y_{1}, \ldots , y_{p}) = g(y_{\pi (1)}, \ldots , y_{\pi (p)})\).

  4. We note that the objective values are decreasing \(f(\varvec{\phi }^{(k+1)}) \le f(\varvec{\phi }^{(k)})\) for all k (See Proposition 7).

  5. We call \(f_{2}(\varvec{\phi })\) a spectral function as it is a function of the eigenvalues (or spectral values) \(\{\lambda _{i}^*\}_1^p\).

  6. We note that non-differentiability occurs if \(g(y_{(r+1)})=g(y_{(r)})\).

  7. This function is available as a part of Matlab’s PRML toolbox https://www.mathworks.com/matlabcentral/fileexchange/55883-probabilistic-pca-and-factor-analysis?focused=6047050&tab=function.

  8. Available at: https://web.stanford.edu/~hastie/ElemStatLearn/datasets/phoneme.data.

  9. Available at https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.test.gz.

  10. Data available at http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.info.

  11. Note that additional assumptions may be needed to ensure a unique decomposition of \(\varvec{\Sigma }\) into its components \(\varvec{\Psi }\) and \(\varvec{\Theta }=\mathbf {L} \mathbf {L}^\top \). However, the optimization task is well defined even in the absence of such identifiability constraints.

  12. This encourages a conditional independence structure among the variables in \(\mathbf {u}\) (assuming \(\mathbf {u}\) follows a multivariate normal distribution).

  13. We declare convergence if the relative change in successive objective values is smaller than \(10^{-4}\).

References

  1. Ahn, M., Pang, J.-S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  2. Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)

    MATH  Google Scholar 

  3. Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv preprint arXiv:1509.00426

  4. Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bai, J., Ng, S.: Large dimensional factor analysis. Found. Trends Econom. 3(2), 89–163 (2008)

    Article  Google Scholar 

  6. Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)

    MathSciNet  MATH  Google Scholar 

  7. Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach. Wiley, London (2011)

    Book  MATH  Google Scholar 

  8. Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(29), 1–53 (2017)

    MathSciNet  MATH  Google Scholar 

  9. Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    MATH  Google Scholar 

  10. Borwein, J., Lewis, A.: Convex Analysis and Nonlinear Optimization. Springer, New York (2006)

    Book  MATH  Google Scholar 

  11. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  12. Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415(1), 20–30 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Davis, C.: All convex invariant functions of hermitian matrices. Archiv der Mathematik 8(4), 276–278 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  14. Dinh, T.P., Le T., Hoai A.: Recent advances in dc programming and DCA. In: Transactions on Computational Intelligence XIII, pp. 1–37. Springer, New York (2014)

  15. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007)

    Article  MATH  Google Scholar 

  16. Golub, G., Van Loan, C.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

    MATH  Google Scholar 

  17. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer, New York (2009)

    Book  MATH  Google Scholar 

  18. Hiriart-Urruty, J-B.: Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In: Convexity and duality in optimization, pp. 37–70. Springer, New York (1985)

  19. Jöreskog, K.G.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  20. Larsen, R.M.: PROPACK-Software for large and sparse SVD calculations (2004). http://sun.stanford.edu/rmunk/PROPACK

  21. Lawley, D.N.: the estimation of factor loadings by the method of maximum likelihood. Proc. R. Soc. Edinb. 60(01), 64–82 (1940)

    Article  MathSciNet  MATH  Google Scholar 

  22. Lawley, D.N.: Some new results in maximum likelihood factor analysis. Proc. R. Soc. Edinb. 67(01), 256–264 (1967)

    MathSciNet  MATH  Google Scholar 

  23. Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method, 2nd edn. Butterworth, London (1971)

    MATH  Google Scholar 

  24. Lewis, A.: Derivatives of spectral functions. Math. Oper. Res. 21(3), 576–588 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  25. Lewis, A.S.: Convex analysis on the hermitian matrices. SIAM J. Optim. 6, 164–177 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  26. Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)

    MATH  Google Scholar 

  27. Nouiehed, M., Pang, J.-S., Razaviyayn, M.: On the pervasiveness of difference-convexity in optimization and statistics (2017). arXiv preprint arXiv:1704.03535

  28. O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  29. Pang, J.-S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth dc programs. Math. Oper. Res. 42(1), 95–118 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  30. Pham Dinh, T., Ngai, H.V., Le Thi, H.A.: Convergence analysis of dc algorithm for dc programming with subanalytic data (2013) (preprint)

  31. Robertson, D., Symons, J.: Maximum likelihood factor analysis with rank-deficient sample covariance matrices. J. Multiv. Anal. 98(4), 813–828 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  32. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, New York (2009)

    MATH  Google Scholar 

  33. Rubin, D.B., Thayer, D.T.: Em algorithms for ml factor analysis. Psychometrika 47(1), 69–76 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  34. Saunderson, J., Chandrasekaran, V., Parrilo, P., Willsky, A.: Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM J. Matrix Anal. Appl. 33(4), 1395–1416 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  35. Shapiro, A., Ten Berge, J.: Statistical inference of minimum rank factor analysis. Psychometrika 67, 79–94 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  36. Spearman, C.: “General Intelligence,” objectively determined and measured. Am. J. Psychol. 15, 201–293 (1904)

    Article  Google Scholar 

  37. Tuy, H.: Dc optimization: theory, methods and algorithms. In: Handbook of Global Optimization, pp. 149–216. Springer, New York (1995)

  38. Vangeepuram, S., Lanckriet, G.R.G.B.: On the convergence of the concave-convex procedure. In: Advances in Neural Information Processing Systems, (NIPS), vol. 22. MIT Press (2009)

  39. Yuille, A., Rangarajan, A.: The concave-convex procedure (cccp). Neural Comput. 15, 915–936 (2003)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors will like to thank the Editors and the anonymous Reviewers for their helpful feedback and detailed comments that helped improve this manuscript. Rahul Mazumder was partially supported by ONR-N000141512342, ONR-N000141812298 (YIP) and NSF-IIS1718258. The authors will also like to thank Columbia University (Statistics department) for hosting Koulik Khamaru as a summer intern, when this work started.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rahul Mazumder.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proposition 10

(See Sect. 6 in [10]) Suppose the function \(g: \mathbf {E} \mapsto (-\infty , \infty )\) is convex, and the point x lies in interior of dom(g) with \(\mathbf {E} \subset \mathfrak {R}^{m}\). Let \(x^r,x \in \mathbf {E}\) and \(\nu ^r\) be a subgradient of g evaluated at \(x^r\) for \(r\ge 1\). If \(x^{r} \rightarrow x\) and \(\nu ^{r} \rightarrow \nu \) as \(r \rightarrow \infty \), then \(\nu \) is a subgradient of g evaluated at x.

1.1 Proof of Proposition 9

Note that the objective function \(f(\varvec{\phi })\) (see (22)) is unbounded above when \(\phi _{i} \rightarrow 0\) for any \(i \in [p]\)—see also Proposition 3. This implies that there exists a \(\alpha >0\) such that \(\varvec{\phi }^{(k)} \in [\alpha , \tfrac{1}{\epsilon }]^p\) for all k. The boundedness of the sequence \(\varvec{\phi }^{(k)}\) implies the existence of a limit point of \(\varvec{\phi }^{(k)}\), say, \(\varvec{\phi }^{*}\). Let \(\varvec{\phi }^{(k_j)}\) be a subsequence such that \( \varvec{\phi }^{(k_j)} \rightarrow \varvec{\phi }^{*}\). Note that for every k, \(\varvec{\phi }^{(k+1)} \in {\hbox {arg min}}_{\varvec{\phi }\in {\mathcal {C}}} F(\varvec{\phi }; \varvec{\phi }^{(k)})\) is equivalent to

$$\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{(k+1)}) - \partial f_{2} ( \varvec{\phi }^{(k)}), \varvec{\phi }- \varvec{\phi }^{(k+1)} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}. \end{aligned}$$
(54)

Now consider the sequence \(\varvec{\phi }^{(k_j)}\) as \(k_j \rightarrow \infty \). Using the fact that \(\varvec{\phi }^{(k+1)} - \varvec{\phi }^{(k)} \rightarrow \mathbf {0}\); it follows from the continuity of \(\nabla f_{1}(\cdot )\) that: \(\nabla f_{1}(\varvec{\phi }^{(k_j+1)}) \rightarrow \nabla f_{1}(\varvec{\phi }^{*})\) as \(k_j \rightarrow \infty \).

Note that \( \partial f_2(\varvec{\phi }^{(k_j)})\) (see (36)) is bounded as \(\varvec{\phi }^{(k_j)} \in [\alpha , \tfrac{1}{\epsilon }]^p\). Passing onto a further subsequence \(\{k'_j\}\) if necessary, it follows that \(\partial f_2(\varvec{\phi }^{(k'_j)}) \rightarrow \vartheta \) (say). Using Proposition 10, we conclude that \(\vartheta \) is a subgradient of \(f_2\) evaluated at \(\varvec{\phi }^*\). As \(k'_j \rightarrow \infty \), the above argument along with (54) implies that:

$$\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{*}) - \partial f_{2} ( \varvec{\phi }^{*}), \varvec{\phi }- \varvec{\phi }^{*} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}, \end{aligned}$$
(55)

where, \(\partial f_{2} ( \varvec{\phi }^{*})\) is a subgradient of \(f_{2}\) evaluated at \(\varvec{\phi }^{*}\). (55) implies that \(\varvec{\phi }^*\) is a first order stationary point.

1.2 Representing \(\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0\)

Note that \(\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0\) is equivalent to setting (12) to zero, which leads to \(\mathbf {L}=\mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L}\). By applying Sherman Woodbury formula on \((\varvec{\Psi }+ \mathbf {L}\mathbf {L}^\top )^{-1}\), we have the following:

$$\begin{aligned} \mathbf {L}= & {} \mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L} \nonumber \\= & {} \mathbf {S}\left( \varvec{\Psi }^{-1} - \varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}\mathbf {L}^\top \varvec{\Psi }^{-1} \right) \mathbf {L} \end{aligned}$$
(56)
$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) \end{aligned}$$
(57)
$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \mathbf {I} - \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \right) \end{aligned}$$
(58)
$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}, \end{aligned}$$
(59)

where, Eqn (56) follows from (8); Eqn (58) follows from (57) by using the observation that for a PSD matrix \(\mathbf {B}\), we have the following identity: \((\mathbf {I}+\mathbf {B})^{-1}\mathbf {B} = \mathbf {I} - (\mathbf {I} + \mathbf {B})^{-1}\) (this can be verified by simple algebra).

Finally, we note that (13) follows by using the definition of \(\mathbf {L}^*\) and \(\mathbf {S}^*\) in (59) and doing some algebraic manipulations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khamaru, K., Mazumder, R. Computation of the maximum likelihood estimator in low-rank factor analysis. Math. Program. 176, 279–310 (2019). https://doi.org/10.1007/s10107-019-01370-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01370-7

Keywords

Mathematics Subject Classification

Navigation