Computation of the maximum likelihood estimator in low-rank factor analysis

Khamaru, Koulik; Mazumder, Rahul

doi:10.1007/s10107-019-01370-7

Computation of the maximum likelihood estimator in low-rank factor analysis

Full Length Paper
Series B
Published: 02 March 2019

Volume 176, pages 279–310, (2019)
Cite this article

Mathematical Programming Submit manuscript

922 Accesses
6 Citations
Explore all metrics

Abstract

Factor analysis is a classical multivariate dimensionality reduction technique popularly used in statistics, econometrics and data science. Estimation for factor analysis is often carried out via the maximum likelihood principle, which seeks to maximize the Gaussian likelihood under the assumption that the positive definite covariance matrix can be decomposed as the sum of a low-rank positive semidefinite matrix and a diagonal matrix with nonnegative entries. This leads to a challenging rank constrained nonconvex optimization problem, for which very few reliable computational algorithms are available. We reformulate the low-rank maximum likelihood factor analysis task as a nonlinear nonsmooth semidefinite optimization problem, study various structural properties of this reformulation; and propose fast and scalable algorithms based on difference of convex optimization. Our approach has computational guarantees, gracefully scales to large problems, is applicable to situations where the sample covariance matrix is rank deficient and adapts to variants of the maximum likelihood problem with additional constraints on the model parameters. Our numerical experiments validate the usefulness of our approach over existing state-of-the-art approaches for maximum likelihood factor analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

Density-Based Clustering Based on Hierarchical Density Estimates

Thinking twice about sum scores

Article 22 April 2020

Notes

Indeed, $\varvec{\Psi }\succeq \epsilon \mathbf {I}$ implies that $\varvec{\Sigma }\succeq \epsilon \mathbf {I} \succ \mathbf {0}$. Thus, $-\log \det (\varvec{\Sigma }^{-1}) \ge p\log (\epsilon )$ and $ \mathrm {tr}\left( \varvec{\Sigma }^{-1} \mathbf {S}\right) \ge 0$ which shows that Problem (2) is bounded below. Note that Problem (2) with $\epsilon =0$ need not have a finite solution, i.e., the ML solution need not exist. Note that if for some i, we have $\psi _i \rightarrow \infty $ then ${{\mathcal {L}}}(\varvec{\Sigma }) \rightarrow \infty $, a similar argument applies if $\mathbf {L}\mathbf {L}^\top $ becomes unbounded. Thus the infimum of Problem (2) is attained when $\epsilon >0$.
We have observed this in our experiments and they are reported in our section on numerical experiments.
A function $g(y_{1}, \ldots , y_{p}) : \mathfrak {R}^{p} \rightarrow \mathfrak {R}$ is said to be symmetric in its arguments if, for any permutation $\pi $ of the indices $\{1, \ldots , p \}$, we have $g(y_{1}, \ldots , y_{p}) = g(y_{\pi (1)}, \ldots , y_{\pi (p)})$.
We note that the objective values are decreasing $f(\varvec{\phi }^{(k+1)}) \le f(\varvec{\phi }^{(k)})$ for all k (See Proposition 7).
We call $f_{2}(\varvec{\phi })$ a spectral function as it is a function of the eigenvalues (or spectral values) $\{\lambda _{i}^*\}_1^p$.
We note that non-differentiability occurs if $g(y_{(r+1)})=g(y_{(r)})$.
This function is available as a part of Matlab’s PRML toolbox https://www.mathworks.com/matlabcentral/fileexchange/55883-probabilistic-pca-and-factor-analysis?focused=6047050&tab=function.
Available at: https://web.stanford.edu/~hastie/ElemStatLearn/datasets/phoneme.data.
Available at https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.test.gz.
Data available at http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.info.
Note that additional assumptions may be needed to ensure a unique decomposition of $\varvec{\Sigma }$ into its components $\varvec{\Psi }$ and $\varvec{\Theta }=\mathbf {L} \mathbf {L}^\top $. However, the optimization task is well defined even in the absence of such identifiability constraints.
This encourages a conditional independence structure among the variables in $\mathbf {u}$ (assuming $\mathbf {u}$ follows a multivariate normal distribution).
We declare convergence if the relative change in successive objective values is smaller than $10^{-4}$.

References

Ahn, M., Pang, J.-S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)
Article MathSciNet MATH Google Scholar
Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)
MATH Google Scholar
Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv preprint arXiv:1509.00426
Bai, J., Li, K.: Statistical analysis of factor models of high dimension. Ann. Stat. 40(1), 436–465 (2012)
Article MathSciNet MATH Google Scholar
Bai, J., Ng, S.: Large dimensional factor analysis. Found. Trends Econom. 3(2), 89–163 (2008)
Article Google Scholar
Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)
MathSciNet MATH Google Scholar
Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach. Wiley, London (2011)
Book MATH Google Scholar
Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(29), 1–53 (2017)
MathSciNet MATH Google Scholar
Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Borwein, J., Lewis, A.: Convex Analysis and Nonlinear Optimization. Springer, New York (2006)
Book MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Brand, M.: Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415(1), 20–30 (2006)
Article MathSciNet MATH Google Scholar
Davis, C.: All convex invariant functions of hermitian matrices. Archiv der Mathematik 8(4), 276–278 (1957)
Article MathSciNet MATH Google Scholar
Dinh, T.P., Le T., Hoai A.: Recent advances in dc programming and DCA. In: Transactions on Computational Intelligence XIII, pp. 1–37. Springer, New York (2014)
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007)
Article MATH Google Scholar
Golub, G., Van Loan, C.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)
MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer, New York (2009)
Book MATH Google Scholar
Hiriart-Urruty, J-B.: Generalized differentiability/duality and optimization for problems dealing with differences of convex functions. In: Convexity and duality in optimization, pp. 37–70. Springer, New York (1985)
Jöreskog, K.G.: Some contributions to maximum likelihood factor analysis. Psychometrika 32(4), 443–482 (1967)
Article MathSciNet MATH Google Scholar
Larsen, R.M.: PROPACK-Software for large and sparse SVD calculations (2004). http://sun.stanford.edu/rmunk/PROPACK
Lawley, D.N.: the estimation of factor loadings by the method of maximum likelihood. Proc. R. Soc. Edinb. 60(01), 64–82 (1940)
Article MathSciNet MATH Google Scholar
Lawley, D.N.: Some new results in maximum likelihood factor analysis. Proc. R. Soc. Edinb. 67(01), 256–264 (1967)
MathSciNet MATH Google Scholar
Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method, 2nd edn. Butterworth, London (1971)
MATH Google Scholar
Lewis, A.: Derivatives of spectral functions. Math. Oper. Res. 21(3), 576–588 (1996)
Article MathSciNet MATH Google Scholar
Lewis, A.S.: Convex analysis on the hermitian matrices. SIAM J. Optim. 6, 164–177 (1996)
Article MathSciNet MATH Google Scholar
Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)
MATH Google Scholar
Nouiehed, M., Pang, J.-S., Razaviyayn, M.: On the pervasiveness of difference-convexity in optimization and statistics (2017). arXiv preprint arXiv:1704.03535
O’Donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016)
Article MathSciNet MATH Google Scholar
Pang, J.-S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth dc programs. Math. Oper. Res. 42(1), 95–118 (2016)
Article MathSciNet MATH Google Scholar
Pham Dinh, T., Ngai, H.V., Le Thi, H.A.: Convergence analysis of dc algorithm for dc programming with subanalytic data (2013) (preprint)
Robertson, D., Symons, J.: Maximum likelihood factor analysis with rank-deficient sample covariance matrices. J. Multiv. Anal. 98(4), 813–828 (2007)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, New York (2009)
MATH Google Scholar
Rubin, D.B., Thayer, D.T.: Em algorithms for ml factor analysis. Psychometrika 47(1), 69–76 (1982)
Article MathSciNet MATH Google Scholar
Saunderson, J., Chandrasekaran, V., Parrilo, P., Willsky, A.: Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting. SIAM J. Matrix Anal. Appl. 33(4), 1395–1416 (2012)
Article MathSciNet MATH Google Scholar
Shapiro, A., Ten Berge, J.: Statistical inference of minimum rank factor analysis. Psychometrika 67, 79–94 (2002)
Article MathSciNet MATH Google Scholar
Spearman, C.: “General Intelligence,” objectively determined and measured. Am. J. Psychol. 15, 201–293 (1904)
Article Google Scholar
Tuy, H.: Dc optimization: theory, methods and algorithms. In: Handbook of Global Optimization, pp. 149–216. Springer, New York (1995)
Vangeepuram, S., Lanckriet, G.R.G.B.: On the convergence of the concave-convex procedure. In: Advances in Neural Information Processing Systems, (NIPS), vol. 22. MIT Press (2009)
Yuille, A., Rangarajan, A.: The concave-convex procedure (cccp). Neural Comput. 15, 915–936 (2003)
Article MATH Google Scholar

Download references

Acknowledgements

The authors will like to thank the Editors and the anonymous Reviewers for their helpful feedback and detailed comments that helped improve this manuscript. Rahul Mazumder was partially supported by ONR-N000141512342, ONR-N000141812298 (YIP) and NSF-IIS1718258. The authors will also like to thank Columbia University (Statistics department) for hosting Koulik Khamaru as a summer intern, when this work started.

Author information

Authors and Affiliations

Department of Statistics, University of California Berkeley, Berkeley, USA
Koulik Khamaru
MIT Sloan School of Management, Operations Research Center and Center for Statistics, Massachusetts Institute of Technology, Cambridge, USA
Rahul Mazumder

Authors

Koulik Khamaru
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Mazumder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rahul Mazumder.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proposition 10

(See Sect. 6 in [10]) Suppose the function $g: \mathbf {E} \mapsto (-\infty , \infty )$ is convex, and the point x lies in interior of dom(g) with $\mathbf {E} \subset \mathfrak {R}^{m}$. Let $x^r,x \in \mathbf {E}$ and $\nu ^r$ be a subgradient of g evaluated at $x^r$ for $r\ge 1$. If $x^{r} \rightarrow x$ and $\nu ^{r} \rightarrow \nu $ as $r \rightarrow \infty $, then $\nu $ is a subgradient of g evaluated at x.

1.1 Proof of Proposition 9

Note that the objective function $f(\varvec{\phi })$ (see (22)) is unbounded above when $\phi _{i} \rightarrow 0$ for any $i \in [p]$—see also Proposition 3. This implies that there exists a $\alpha >0$ such that $\varvec{\phi }^{(k)} \in [\alpha , \tfrac{1}{\epsilon }]^p$ for all k. The boundedness of the sequence $\varvec{\phi }^{(k)}$ implies the existence of a limit point of $\varvec{\phi }^{(k)}$, say, $\varvec{\phi }^{*}$. Let $\varvec{\phi }^{(k_j)}$ be a subsequence such that $ \varvec{\phi }^{(k_j)} \rightarrow \varvec{\phi }^{*}$. Note that for every k, $\varvec{\phi }^{(k+1)} \in {\hbox {arg min}}_{\varvec{\phi }\in {\mathcal {C}}} F(\varvec{\phi }; \varvec{\phi }^{(k)})$ is equivalent to

$$\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{(k+1)}) - \partial f_{2} ( \varvec{\phi }^{(k)}), \varvec{\phi }- \varvec{\phi }^{(k+1)} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}. \end{aligned}$$

(54)

Now consider the sequence $\varvec{\phi }^{(k_j)}$ as $k_j \rightarrow \infty $. Using the fact that $\varvec{\phi }^{(k+1)} - \varvec{\phi }^{(k)} \rightarrow \mathbf {0}$; it follows from the continuity of $\nabla f_{1}(\cdot )$ that: $\nabla f_{1}(\varvec{\phi }^{(k_j+1)}) \rightarrow \nabla f_{1}(\varvec{\phi }^{*})$ as $k_j \rightarrow \infty $.

Note that $ \partial f_2(\varvec{\phi }^{(k_j)})$ (see (36)) is bounded as $\varvec{\phi }^{(k_j)} \in [\alpha , \tfrac{1}{\epsilon }]^p$. Passing onto a further subsequence $\{k'_j\}$ if necessary, it follows that $\partial f_2(\varvec{\phi }^{(k'_j)}) \rightarrow \vartheta $ (say). Using Proposition 10, we conclude that $\vartheta $ is a subgradient of $f_2$ evaluated at $\varvec{\phi }^*$. As $k'_j \rightarrow \infty $, the above argument along with (54) implies that:

$$\begin{aligned} \left\langle \nabla f_{1}(\varvec{\phi }^{*}) - \partial f_{2} ( \varvec{\phi }^{*}), \varvec{\phi }- \varvec{\phi }^{*} \right\rangle \ge 0~~\forall \varvec{\phi }\in {{\mathcal {C}}}, \end{aligned}$$

(55)

where, $\partial f_{2} ( \varvec{\phi }^{*})$ is a subgradient of $f_{2}$ evaluated at $\varvec{\phi }^{*}$. (55) implies that $\varvec{\phi }^*$ is a first order stationary point.

1.2 Representing $\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0$

Note that $\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0$ is equivalent to setting (12) to zero, which leads to $\mathbf {L}=\mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L}$. By applying Sherman Woodbury formula on $(\varvec{\Psi }+ \mathbf {L}\mathbf {L}^\top )^{-1}$, we have the following:

$$\begin{aligned} \mathbf {L}= & {} \mathbf {S}\varvec{\Sigma }^{-1}\mathbf {L} \nonumber \\= & {} \mathbf {S}\left( \varvec{\Psi }^{-1} - \varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}\mathbf {L}^\top \varvec{\Psi }^{-1} \right) \mathbf {L} \end{aligned}$$

(56)

$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) \end{aligned}$$

(57)

$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} - \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L} \left( \mathbf {I} - \left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1} \right) \end{aligned}$$

(58)

$$\begin{aligned}= & {} \mathbf {S}\varvec{\Psi }^{-1}\mathbf {L}\left( \mathbf {I} + \mathbf {L}^\top \varvec{\Psi }^{-1}\mathbf {L} \right) ^{-1}, \end{aligned}$$

(59)

where, Eqn (56) follows from (8); Eqn (58) follows from (57) by using the observation that for a PSD matrix $\mathbf {B}$, we have the following identity: $(\mathbf {I}+\mathbf {B})^{-1}\mathbf {B} = \mathbf {I} - (\mathbf {I} + \mathbf {B})^{-1}$ (this can be verified by simple algebra).

Finally, we note that (13) follows by using the definition of $\mathbf {L}^*$ and $\mathbf {S}^*$ in (59) and doing some algebraic manipulations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khamaru, K., Mazumder, R. Computation of the maximum likelihood estimator in low-rank factor analysis. Math. Program. 176, 279–310 (2019). https://doi.org/10.1007/s10107-019-01370-7

Download citation

Received: 18 January 2018
Accepted: 02 February 2019
Published: 02 March 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s10107-019-01370-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computation of the maximum likelihood estimator in low-rank factor analysis

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Density-Based Clustering Based on Hierarchical Density Estimates

Thinking twice about sum scores

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proposition 10

1.1 Proof of Proposition 9

1.2 Representing \(\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0\)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Computation of the maximum likelihood estimator in low-rank factor analysis

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Density-Based Clustering Based on Hierarchical Density Estimates

Thinking twice about sum scores

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proposition 10

1.1 Proof of Proposition 9

1.2 Representing \(\partial {h(\varvec{\Psi },\mathbf {L})}/{\partial \mathbf {L}}=0\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation