# Computation of the maximum likelihood estimator in low-rank factor analysis

## Abstract

Factor analysis is a classical multivariate dimensionality reduction technique popularly used in statistics, econometrics and data science. Estimation for factor analysis is often carried out via the maximum likelihood principle, which seeks to maximize the Gaussian likelihood under the assumption that the positive definite covariance matrix can be decomposed as the sum of a low-rank positive semidefinite matrix and a diagonal matrix with nonnegative entries. This leads to a challenging rank constrained nonconvex optimization problem, for which very few reliable computational algorithms are available. We reformulate the low-rank maximum likelihood factor analysis task as a nonlinear nonsmooth semidefinite optimization problem, study various structural properties of this reformulation; and propose fast and scalable algorithms based on difference of convex optimization. Our approach has computational guarantees, gracefully scales to large problems, is applicable to situations where the sample covariance matrix is rank deficient and adapts to variants of the maximum likelihood problem with additional constraints on the model parameters. Our numerical experiments validate the usefulness of our approach over existing state-of-the-art approaches for maximum likelihood factor analysis.

This is a preview of subscription content, access via your institution.

1. Indeed, $$\varvec{\Psi }\succeq \epsilon \mathbf {I}$$ implies that $$\varvec{\Sigma }\succeq \epsilon \mathbf {I} \succ \mathbf {0}$$. Thus, $$-\log \det (\varvec{\Sigma }^{-1}) \ge p\log (\epsilon )$$ and $$\mathrm {tr}\left( \varvec{\Sigma }^{-1} \mathbf {S}\right) \ge 0$$ which shows that Problem (2) is bounded below. Note that Problem (2) with $$\epsilon =0$$ need not have a finite solution, i.e., the ML solution need not exist. Note that if for some i, we have $$\psi _i \rightarrow \infty$$ then $${{\mathcal {L}}}(\varvec{\Sigma }) \rightarrow \infty$$, a similar argument applies if $$\mathbf {L}\mathbf {L}^\top$$ becomes unbounded. Thus the infimum of Problem (2) is attained when $$\epsilon >0$$.

2. We have observed this in our experiments and they are reported in our section on numerical experiments.

3. A function $$g(y_{1}, \ldots , y_{p}) : \mathfrak {R}^{p} \rightarrow \mathfrak {R}$$ is said to be symmetric in its arguments if, for any permutation $$\pi$$ of the indices $$\{1, \ldots , p \}$$, we have $$g(y_{1}, \ldots , y_{p}) = g(y_{\pi (1)}, \ldots , y_{\pi (p)})$$.

4. We note that the objective values are decreasing $$f(\varvec{\phi }^{(k+1)}) \le f(\varvec{\phi }^{(k)})$$ for all k (See Proposition 7).

5. We call $$f_{2}(\varvec{\phi })$$ a spectral function as it is a function of the eigenvalues (or spectral values) $$\{\lambda _{i}^*\}_1^p$$.

6. We note that non-differentiability occurs if $$g(y_{(r+1)})=g(y_{(r)})$$.

7. This function is available as a part of Matlab’s PRML toolbox https://www.mathworks.com/matlabcentral/fileexchange/55883-probabilistic-pca-and-factor-analysis?focused=6047050&tab=function.

8. Data available at http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.info.

9. Note that additional assumptions may be needed to ensure a unique decomposition of $$\varvec{\Sigma }$$ into its components $$\varvec{\Psi }$$ and $$\varvec{\Theta }=\mathbf {L} \mathbf {L}^\top$$. However, the optimization task is well defined even in the absence of such identifiability constraints.

10. This encourages a conditional independence structure among the variables in $$\mathbf {u}$$ (assuming $$\mathbf {u}$$ follows a multivariate normal distribution).

11. We declare convergence if the relative change in successive objective values is smaller than $$10^{-4}$$.

## References

1. Ahn, M., Pang, J.-S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)

2. Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)

3. Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv preprint arXiv:1509.00426

4. Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)

5. Bai, J., Ng, S.: Large dimensional factor analysis. Found. Trends Econom. 3(2), 89–163 (2008)

6. Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)

7. Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach. Wiley, London (2011)

8. Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(29), 1–53 (2017)

9. Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)

10. Borwein, J., Lewis, A.: Convex Analysis and Nonlinear Optimization. Springer, New York (2006)

11. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

12. Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415(1), 20–30 (2006)

13. Davis, C.: All convex invariant functions of hermitian matrices. Archiv der Mathematik 8(4), 276–278 (1957)

14. Dinh, T.P., Le T., Hoai A.: Recent advances in dc programming and DCA. In: Transactions on Computational Intelligence XIII, pp. 1–37. Springer, New York (2014)

15. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007)

16. Golub, G., Van Loan, C.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)

17. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer, New York (2009)

18. Hiriart-Urruty, J-B.: Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In: Convexity and duality in optimization, pp. 37–70. Springer, New York (1985)

19. Jöreskog, K.G.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)

20. Larsen, R.M.: PROPACK-Software for large and sparse SVD calculations (2004). http://sun.stanford.edu/rmunk/PROPACK

21. Lawley, D.N.: the estimation of factor loadings by the method of maximum likelihood. Proc. R. Soc. Edinb. 60(01), 64–82 (1940)

22. Lawley, D.N.: Some new results in maximum likelihood factor analysis. Proc. R. Soc. Edinb. 67(01), 256–264 (1967)

23. Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method, 2nd edn. Butterworth, London (1971)

24. Lewis, A.: Derivatives of spectral functions. Math. Oper. Res. 21(3), 576–588 (1996)

25. Lewis, A.S.: Convex analysis on the hermitian matrices. SIAM J. Optim. 6, 164–177 (1996)

26. Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)

27. Nouiehed, M., Pang, J.-S., Razaviyayn, M.: On the pervasiveness of difference-convexity in optimization and statistics (2017). arXiv preprint arXiv:1704.03535

28. O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016)

29. Pang, J.-S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth dc programs. Math. Oper. Res. 42(1), 95–118 (2016)

30. Pham Dinh, T., Ngai, H.V., Le Thi, H.A.: Convergence analysis of dc algorithm for dc programming with subanalytic data (2013) (preprint)

31. Robertson, D., Symons, J.: Maximum likelihood factor analysis with rank-deficient sample covariance matrices. J. Multiv. Anal. 98(4), 813–828 (2007)

32. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, New York (2009)

33. Rubin, D.B., Thayer, D.T.: Em algorithms for ml factor analysis. Psychometrika 47(1), 69–76 (1982)

34. Saunderson, J., Chandrasekaran, V., Parrilo, P., Willsky, A.: Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM J. Matrix Anal. Appl. 33(4), 1395–1416 (2012)

35. Shapiro, A., Ten Berge, J.: Statistical inference of minimum rank factor analysis. Psychometrika 67, 79–94 (2002)

36. Spearman, C.: “General Intelligence,” objectively determined and measured. Am. J. Psychol. 15, 201–293 (1904)

37. Tuy, H.: Dc optimization: theory, methods and algorithms. In: Handbook of Global Optimization, pp. 149–216. Springer, New York (1995)

38. Vangeepuram, S., Lanckriet, G.R.G.B.: On the convergence of the concave-convex procedure. In: Advances in Neural Information Processing Systems, (NIPS), vol. 22. MIT Press (2009)

39. Yuille, A., Rangarajan, A.: The concave-convex procedure (cccp). Neural Comput. 15, 915–936 (2003)

## Acknowledgements

The authors will like to thank the Editors and the anonymous Reviewers for their helpful feedback and detailed comments that helped improve this manuscript. Rahul Mazumder was partially supported by ONR-N000141512342, ONR-N000141812298 (YIP) and NSF-IIS1718258. The authors will also like to thank Columbia University (Statistics department) for hosting Koulik Khamaru as a summer intern, when this work started.

## Author information

Authors

### Corresponding author

Correspondence to Rahul Mazumder.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix

### Proposition 10

(See Sect. 6 in ) Suppose the function $$g: \mathbf {E} \mapsto (-\infty , \infty )$$ is convex, and the point x lies in interior of dom(g) with $$\mathbf {E} \subset \mathfrak {R}^{m}$$. Let $$x^r,x \in \mathbf {E}$$ and $$\nu ^r$$ be a subgradient of g evaluated at $$x^r$$ for $$r\ge 1$$. If $$x^{r} \rightarrow x$$ and $$\nu ^{r} \rightarrow \nu$$ as $$r \rightarrow \infty$$, then $$\nu$$ is a subgradient of g evaluated at x.

### 1.1 Proof of Proposition 9

Note that the objective function $$f(\varvec{\phi })$$ (see (22)) is unbounded above when $$\phi _{i} \rightarrow 0$$ for any $$i \in [p]$$—see also Proposition 3. This implies that there exists a $$\alpha >0$$ such that $$\varvec{\phi }^{(k)} \in [\alpha , \tfrac{1}{\epsilon }]^p$$ for all k. The boundedness of the sequence $$\varvec{\phi }^{(k)}$$ implies the existence of a limit point of $$\varvec{\phi }^{(k)}$$, say, $$\varvec{\phi }^{*}$$. Let $$\varvec{\phi }^{(k_j)}$$ be a subsequence such that $$\varvec{\phi }^{(k_j)} \rightarrow \varvec{\phi }^{*}$$. Note that for every k, $$\varvec{\phi }^{(k+1)} \in {\hbox {arg min}}_{\varvec{\phi }\in {\mathcal {C}}} F(\varvec{\phi }; \varvec{\phi }^{(k)})$$ is equivalent to

\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{(k+1)}) - \partial f_{2} ( \varvec{\phi }^{(k)}), \varvec{\phi }- \varvec{\phi }^{(k+1)} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}. \end{aligned}
(54)

Now consider the sequence $$\varvec{\phi }^{(k_j)}$$ as $$k_j \rightarrow \infty$$. Using the fact that $$\varvec{\phi }^{(k+1)} - \varvec{\phi }^{(k)} \rightarrow \mathbf {0}$$; it follows from the continuity of $$\nabla f_{1}(\cdot )$$ that: $$\nabla f_{1}(\varvec{\phi }^{(k_j+1)}) \rightarrow \nabla f_{1}(\varvec{\phi }^{*})$$ as $$k_j \rightarrow \infty$$.

Note that $$\partial f_2(\varvec{\phi }^{(k_j)})$$ (see (36)) is bounded as $$\varvec{\phi }^{(k_j)} \in [\alpha , \tfrac{1}{\epsilon }]^p$$. Passing onto a further subsequence $$\{k'_j\}$$ if necessary, it follows that $$\partial f_2(\varvec{\phi }^{(k'_j)}) \rightarrow \vartheta$$ (say). Using Proposition 10, we conclude that $$\vartheta$$ is a subgradient of $$f_2$$ evaluated at $$\varvec{\phi }^*$$. As $$k'_j \rightarrow \infty$$, the above argument along with (54) implies that:

\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{*}) - \partial f_{2} ( \varvec{\phi }^{*}), \varvec{\phi }- \varvec{\phi }^{*} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}, \end{aligned}
(55)

where, $$\partial f_{2} ( \varvec{\phi }^{*})$$ is a subgradient of $$f_{2}$$ evaluated at $$\varvec{\phi }^{*}$$. (55) implies that $$\varvec{\phi }^*$$ is a first order stationary point.

### 1.2 Representing $$\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0$$

Note that $$\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0$$ is equivalent to setting (12) to zero, which leads to $$\mathbf {L}=\mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L}$$. By applying Sherman Woodbury formula on $$(\varvec{\Psi }+ \mathbf {L}\mathbf {L}^\top )^{-1}$$, we have the following:

\begin{aligned} \mathbf {L}= & {} \mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L} \nonumber \\= & {} \mathbf {S}\left( \varvec{\Psi }^{-1} - \varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}\mathbf {L}^\top \varvec{\Psi }^{-1} \right) \mathbf {L} \end{aligned}
(56)
\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) \end{aligned}
(57)
\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \mathbf {I} - \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \right) \end{aligned}
(58)
\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}, \end{aligned}
(59)

where, Eqn (56) follows from (8); Eqn (58) follows from (57) by using the observation that for a PSD matrix $$\mathbf {B}$$, we have the following identity: $$(\mathbf {I}+\mathbf {B})^{-1}\mathbf {B} = \mathbf {I} - (\mathbf {I} + \mathbf {B})^{-1}$$ (this can be verified by simple algebra).

Finally, we note that (13) follows by using the definition of $$\mathbf {L}^*$$ and $$\mathbf {S}^*$$ in (59) and doing some algebraic manipulations.

## Rights and permissions

Reprints and Permissions