Factor analysis models via I-divergence optimization

Given a positive definite covariance matrix \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\Sigma }$$\end{document}Σ^ of dimension n, we approximate it with a covariance of the form \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$HH^\top +D$$\end{document}HH⊤+D, where H has a prescribed number \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k<n$$\end{document}k<n of columns and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D>0$$\end{document}D>0 is diagonal. The quality of the approximation is gauged by the I-divergence between the zero mean normal laws with covariances \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\Sigma }$$\end{document}Σ^ and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$HH^\top +D$$\end{document}HH⊤+D, respectively. To determine a pair (H, D) that minimizes the I-divergence we construct, by lifting the minimization into a larger space, an iterative alternating minimization algorithm (AML) à la Csiszár–Tusnády. As it turns out, the proper choice of the enlarged space is crucial for optimization. The convergence of the algorithm is studied, with special attention given to the case where D is singular. The theoretical properties of the AML are compared to those of the popular EM algorithm for exploratory factor analysis. Inspired by the ECME (a Newton–Raphson variation on EM), we develop a similar variant of AML, called ACML, and in a few numerical experiments, we compare the performances of the four algorithms.


Introduction
Let Y be a given zero mean normal vector of dimension n and covariance Cov(Y ) = . A standard Factor Analysis (FA) model for Y is a linear model where H is a deterministic matrix, X is a standard normal vector of dimension k < n, i.e., with zero mean and Cov(X ) = I k (the k-dimensional identity), and ε is a zero mean normal vector, independent from X , with Cov(ε) = D diagonal. The model (1.1) explains the n components of Y as linear combinations of the k < n components of X , perturbed by the independent noise ε.
The FA model built-in linear structure and data reduction mechanism make it very attractive in applied research. It is not always possible to describe the given Y with a FA model. Indeed, as a consequence of the hypotheses on X and ε, = H H + D, (1.2) a relation which imposes strong structural constraints on the covariance . Determining whether the given Y admits a FA model (1.1) requires the solution of an algebraic problem: given , find, if they exist, pairs (H, D) such that (1.2) holds. The structural constraints impose that H must be a Correspondence should be made to Peter Spreij, Korteweg-de Vries Institute for Mathematics, Universiteit van Amsterdam, POBox 94248, 1090 GE Amsterdam, The Netherlands. Email: spreij@uva.nl 703 tall matrix, and D a diagonal matrix. For a given , the existence and uniqueness of a pair (H, D) are not guaranteed. Generically, the Ledermann bound (Anderson & Rubin, 1956;Ledermann, 1937), gives necessary conditions for the existence of a FA model in terms of k and n.
As it turns out, for the data reduction case of this paper, the right tools to deal with the existence and the construction of an FA model are geometric in nature and come from the theory of stochastic realization, see Finesso and Picci (1984) for an early contribution on the subject.
In the present paper we address the problem of constructing an approximate FA model of the given Y . Since in general the relation (1.2) does not hold for any (H, D), one has to find ways to gauge the closeness of to the FA model covariance H H + D. One possibility is to use a form of least squares as a loss criterion. Here we adopt the I-divergence I( ||H H + D), also known as Kullback-Leibler distance, between the corresponding (multivariate) normal laws. Throughout the paper is given and is assumed to be non-singular.
In statistical inference it is well known, and reviewed in Section 2, that the I-divergence is, up to constants independent of H and D, the parameters yielding the covariance H H + D, the opposite of the normal log likelihood. One has the identity where is now the empirical covariance matrix, used as an estimator of the true covariance H H + D. In the empirical context non-singular is usually the case if the number of variables is smaller than the number of observations. A completely different situation, singular , arises when the number of variables is larger than the number of observations, see e.g., Bai and Li (2012), Trendafilov and Unkel (2011) for recent results.
The choice of the best (H, D) pair can then be posed as a maximum-likelihood problem, as first proposed by Lawley (1940). Lacking a closed form solution, the maximization problem (1.3) has to be attacked numerically, and several optimization algorithms have been either adapted or custom-tailored for it. Among the former, the EM method, introduced in the context of FA estimation by Rubin and Thayer (1982), and still mutating and evolving (Adachi, 2013;Zhao, Yu, & Jiang, 2008;Zhao & Shi, 2014), takes full advantage of the structure of the likelihood in order to guarantee descent at each iteration, although at the expense of a less than ideal convergence rate, which can be slow and sensitive to the initial conditions. It has long been known, see Csiszár and Tusnády (1984), that any EM algorithm can be reformulated embedding the problem in a properly chosen larger parameter space and then performing alternating partial minimizations of the I-divergence over properly defined subspaces. This setup has previously been followed for various problems, e.g., mixture decomposition (Csiszár & Tusnády, 1984), non-negative matrix factorization (Finesso & Spreij, 2006), and approximation of non-negative impulse response functions (Finesso & Spreij, 2015). The advantage afforded by the embedding procedure is that both partial minimizations have closed form solutions; moreover a necessary and sufficient condition of optimality of a geometric flavor, a Pythagoras rule, see Csiszár (1975), is available to check optimality for both partial minimizations. As it turns out, and we prove this assertion in Section 5, the EM method proposed in Rubin and Thayer (1982) corresponds to a suboptimal embedding, as one of the Pythagoras rules fails. The main result of this paper is an iterative algorithm, called AML, for the construction of an (H, D) pair minimizing the I-divergence from using an optimal embedding, for which both Pythagoras rules hold. We also study the behavior of the algorithm in the singular case, i.e., D not of full rank, which is well known to be problematic for FA modeling (Jöreskog, 1967). These theoretical considerations make up the bulk of the paper. We emphasize that the present paper is not on numerical subtleties and (often very clever) improvements as established in the literature to accelerate the convergence of EM type algorithms. Rather, the central feature is the systematic methodology to derive an algorithm by a constructive procedure. Nevertheless, we make a brief foray into the numerical aspects, developing a version of AML, which we call ACML, in the spirit of ECME [a Newton-Raphson variation on EM, Liu and Rubin (1994)].
The remainder of the paper is organized as follows. In Section 2 the approximation problem is posed and discussed, as well as its estimation problem counterpart. Section 3 recasts the problem as a double minimization in a larger space, making it amenable to a solution in terms of alternating minimization. In Section 4, we present the alternating minimization algorithm, provide alternative versions of it, and study its asymptotics. We also point out, in Section 5, the similarities and the differences between our algorithm and the EM algorithm. Section 6 is dedicated to a constrained version of the optimization problem (the singular D case) and the pertinent alternating minimization algorithm. The study of the singular case also sheds light on the boundary limit points of the algorithm presented in Section 4. The last Section 7 is devoted to numerical illustrations, where we compare the performance of the AML, EM, ACML, and ECME algorithms. The Appendix contains the proofs of most of the technical results, and also decomposition results on the I-divergence, which are interesting in their own right, beyond application to Factor Analysis.

Problem Statement
In the present section, we pose the approximation problem and discuss the closely related estimation problem and its statistical counterpart.

Approximation Problem
Consider independent normal, zero mean, random vectors X and ε, of respective dimensions k and n, where k < n, and with Cov(X ) = I k and Cov(ε) = D, a diagonal matrix. For any deterministic conformable matrix H , the n dimensional vector Y given by is called a standard FA model. The matrices (H, D) are the parameters that identify the model. As a consequence of the hypotheses, Given an n-dimensional covariance matrix , one can pose the problem of approximating it with the covariance of a standard FA model, i.e., of finding (H, D) such that In this paper, we pose and solve the problem of finding an optimal approximation (2.3) when the criterion of closeness is the I-divergence (also known as Kullback-Leibler distance) between normal laws. Recall that [see e.g., Theorem 1.8.2 in Ihara (1993)] if ν 1 and ν 2 are two zero mean normal distributions on R m , whose covariance matrices, 1 and 2 , respectively, are both non-singular, the I-divergence I(ν 1 ||ν 2 ) takes the explicit form (| · | denotes determinant) Since, because of zero means, the I-divergence depends only on the covariance matrices, we usually write I( 1 || 2 ) instead of I(ν 1 ||ν 2 ). The approximate FA model problem can then be cast as follows.
Problem 2.1. Given the covariance matrix > 0, of size n, and an integer k < n, minimize 1 where the minimization is taken over all diagonal matrices D ≥ 0, and H ∈ R n×k .
The first result is that in Problem 2.1, a minimum always exists.
Proposition 2.2. There exist matrices H * ∈ R n×k , and non-negative diagonal D * ∈ R n×n that minimize the I-divergence in Problem 2.1.
Proof. The proof can be found in Finesso and Spreij (2007). Finesso and Spreij (2006) studied an approximate non-negative matrix factorization (NMF) problem where the objective function was also of I-divergence type. In that case, using a relaxation technique, the original minimization was lifted to a double minimization in a higher dimensional space, leading naturally to an alternating minimization algorithm. The core of the present paper consists in following the same approach, in the completely different context of covariance matrices, and to solve Problem 2.1 with an alternating minimization algorithm.
As a side remark note that I( 1 || 2 ), computed as in (2.4), can be considered as an Idivergence between two positive definite matrices, without referring to normal distributions. Hence the approximation Problem 2.1 is meaningful even without assuming normality.

Estimation Problem
In a statistical setup, the approximation Problem 2.1 has an equivalent formulation as an estimation problem. Let indeed Y 1 , . . . , Y N be a sequence of i.i.d. observations, whose distribution is modeled according to (2.1), where the matrices H and D are the unknown parameters. This is the context of Exploratory Factor Analysis, where no constraints are imposed on the matrix H . Let denote the sample covariance matrix of the data. If the data have strictly positive covariance, for large enough N , the sample covariance will be strictly positive almost surely. The normal log likelihood (H, D) yields One immediately sees that (H, D) is, up to constants not depending on H and D, equal to −I( ||H H + D). Hence, maximum-likelihood estimation parallels I-divergence minimization in Problem 2.1, only the interpretation is different. The equations for the maximum-likelihood estimators can be found in e.g., Section 14.3.1 of Anderson (1984). In terms of the unknown parameters H and D, with D assumed to be nonsingular, they are which is meaningful also when D is not invertible. It is clear that the system of Equations (2.7), (2.8) does not have an explicit solution. For this reason, several numerical algorithms have been devised, among others a version of the EM algorithm, see Rubin and Thayer (1982). In the present paper we consider an alternative approach, which we will compare with the EM and some of the other algorithms.

Lifted Version of the Problem
In this section, Problem 2.1 is recast in a higher dimensional space, making it amenable to solution via two partial minimizations which will lead, in Section 4.1, to an alternating minimization algorithm. The original n dimensional FA model (2.1) is embedded in a larger n + k dimensional linear model as follows: where the deterministic matrix Q is square of size k, and for the terms H , X , and ε the hypotheses leading to model (2.1) still hold. The vector V , as well as its subvector Z , is therefore zero mean normal, with Remark 3.1. The embedding (3.1) has a simple interpretation as a convenient reparametrization of the following alternative version of the standard FA model (2.1), where Z and ε are zero mean normal vectors of sizes k and n, respectively, with Cov(Z ) = P > 0 and Cov(ε) = I n . Letting P = Q Q and X = Q − Z , where Q is any k × k square root 2 of P, one easily recognizes that (3.1) is a reparametrization of (3.3).
In this paper, all vectors are zero mean and normal, with law completely specified by the covariance matrix. The set of all covariance matrices of size n + k will be denoted as . An element ∈ can always be decomposed as where 11 and 22 are square, of respective sizes n and k.

707
Two subsets of , comprising covariance matrices with special structure, will play a major role in what follows. The subset 0 ⊂ contains all covariances (3.4) with 11 = , a given matrix, i.e., 0 = { ∈ : 11 = }. The generic element of 0 will often be denoted as 0 . Also of interest is the subset 1 ⊂ , containing all covariances (3.4) of the special form (3.2), i.e., The generic elements of 1 will often be denoted 1 , or (H, D, Q) when the parameters are relevant.
We are now ready to pose the following double minimization problem.
Problems 3.2 and 2.1 are related by the following proposition.

Proposition 3.3. Let be given. It holds that
Proof. The proof can be found in Finesso and Spreij (2007).

Partial Minimization Problems
The first partial minimization, required for the solution of Problem 3.2, is as follows.
Problem 3.4. Given a strictly positive definite covariance matrix ∈ , find min 0 ∈ 0 The unique solution to this problem can be computed analytically and is given below. Next we turn to the second partial minimization Problem 3.6. Given a strictly positive definite covariance matrix ∈ , find min 1 ∈ 1 I( || 1 ).
The proposition below gives the explicit solution to this problem.
Note that Problem 3.6 cannot have a unique solution in terms of the matrices H and Q. Indeed, if U is a unitary k × k matrix and H = HU , Q = U Q, then H H = H H , Q Q = Q Q, and H Q = H Q. Nevertheless, the optimal matrices H H , H Q, and Q Q are unique, as it can be easily checked using the expressions in Proposition 3.7.
We close this section by considering a constrained version of the second partial minimization Problem 3.6 to which we will return in Section 5, when we discuss the connection with the EM algorithm. The constraint that we impose is Q = Q 0 fixed, whereas H and D remain free. The set over which the optimization will be carried out is 10 ⊂ 1 defined as We pose the following constrained optimization problem.
The solution is given in the next proposition.
Proposition 3.10. A solution * 10 of Problem 3.9 is obtained for H * = 12 −1 22 Q 0 , and D * as in Proposition 3.7, resulting in * For the constrained problem, the Pythagorean rule does not hold. Intuitively, since 10 ⊂ 1 the optimal value of the constrained Problem 3.9 is in general higher than the optimal value of the free Problem 3.6. To compute the extra cost incurred notice that * 10 ∈ 1 , therefore the Pythagorean rule (3.7) gives where * 1 is as in Proposition 3.7. The quantity I( * 1 || * 10 ) represents the extra cost. An elementary computation gives i.e., the optimizing matrices * 1 and * 10 , see (3.6), (3.8), coincide iff Q 0 Q 0 = 22 . Summarizing the comments on the constrained problem: (i) the optimal value at the minimum is higher since 10 ⊂ 1 , (ii) the extra cost is explicitly given by (3.10), as I( 22 ||Q 0 Q 0 ), and (iii) there is no analog to the Pythagorean rule (3.7). The conclusion is that the solution * 10 of the constrained Problem 3.9 is suboptimal for the free Problem 3.6. The consequences of the suboptimality will be further discussed in Section 5.

Alternating Minimization Algorithm
In this section, the core of the paper, the two partial minimizations of Section 3 are combined into an alternating minimization algorithm for the solution of Problem 2.1. A number of equivalent formulations of the updating equations will be presented and their properties are discussed.

The Algorithm
We suppose that the given covariance matrix is strictly positive definite. To set up the iterative minimization algorithm, assign initial values H 0 , D 0 , Q 0 to the parameters, with D 0 diagonal, Q 0 invertible and H 0 H 0 + D 0 invertible. The updating rules are constructed as follows. Let H t , D t , Q t be the parameters at the t-th iteration, and 1,t = (H t , D t , Q t ) the corresponding covariance, defined as in (3.2). Now solve the two partial minimizations as illustrated below.
where 0,t denotes the solution of the first minimization with input 1,t . To express in a compact form the resulting update equations, define (4.1) Note that, by Remark 3.8, H t H t + D t is actually invertible for all t, since both H 0 H 0 + D 0 and Q 0 have been chosen to be invertible. It is easy to show that also and consequently R t , are strictly positive and therefore invertible. The update equations resulting from the cascade of the two minimizations are Properly choosing the square root in Equation (4.2) will make Q t disappear from the update equations. This is an attractive feature since, at the t-th step of the algorithm, only H t and D t are needed to construct the approximation H t H t + D t . The choice that accomplishes this is (4.3), one gets the AML algorithm.
Algorithm 4.1. (AML) Given H t , D t from the t-th step, and R t as in (4.1), the update equations for a I-divergence minimizing algorithm are (4.6) Since R t only depends on H t and D t , see (4.1), the parameter Q t has been effectively removed from the update equations, although its presence was essential for the derivation.
Remark 4.2. Algorithm 4.1 has one immediate attractive feature: it preserves at each step the diagonal structure of . Indeed, if we let t = H t H t + D t , then it follows from Equation (4.6) that ( t ) = ( ).
Algorithm 4.1 potentially has two drawbacks making its implementation computationally less attractive. To update H t via Equation (4.5) one has to compute, at each step, the square root of the k × k matrix R t and the inverse of the n × n matrix H t H t + D t . The latter problem may in principle be addressed via the matrix inversion lemma, but this requires an invertible D t which could be problematic in practical situations when one encounters nearly singular D t . An alternative approach to Algorithm 4.1, to avoid the square roots at each iteration, is to update H t := H t H t and D t as before.
One can run the update Equation (4.7), for any number T of steps, and then switch back to H T , taking any n × k factor of H T i.e., solve H T = H T H T . Since Equation (4.7) transforms H t into H t+1 preserving the rank, the latter factorization is always possible.

Asymptotic Properties
In Proposition 4.4 below, we collect the asymptotic properties of Algorithm 4.1, also quantifying the I-divergence decrease at each step.
(e) Decrease of the objective function: where t = H t H t + D t is the t-th approximation of , and 0,t , 1,t were defined in Section 4.1. (d) In this case, Equation (4.1) shows that R t = I and substituting this into the update equations yields the conclusion. (e) As matter of fact, we can express the decrease as a sum of two I-divergences, since the algorithm is the superposition of the two partial minimization problems. The results follow from a concatenation of Proposition 3.5 and Proposition 3.7. (f) Assume that all variables converge. Then, from (4.3), it follows that Equation (2.9) holds in the limit. This gives the first of the desired relations, the rest is trivial. (g) This follows by inserting the result of (f.).
In part (f) of Proposition 4.4, we made the assumption that the limit points (H, D) are interior points. This does not always hold true as it may happen that D contains zeros on the diagonal. See also Section 6.2.
Remark 4.5. Assertions (b) and (c) of Proposition 4.4 agree with the recent results of Adachi (2013) (Lemma 1 and Theorem 1) for the closely related EM algorithm with a strictly positive definite empirical covariance matrix . We note that the assertions (b) and (c) are automatic consequences of our setup, they follow from the casting of the problem as a double divergence minimization problem. Indeed, the solutions to the ensuing partial minimization problems are automatically strictly positive definite matrices, as otherwise the minimum divergences would be infinite, which is impossible.

Comparison with the EM Algorithm
In Rubin and Thayer (1982), a version of the EM algorithm (see Dempster, Laird, & Rubin, 1977) has been put forward in the context of estimation for FA models. This algorithm is as follows, with R t as in (4.1).

Algorithm 5.1. (EM)
The EM Algorithm 5.1 differs in both equations from our AML Algorithm 4.1. It is well known that EM algorithms can be derived as alternating minimizations, see Csiszár and Tusnády (1984), it is therefore interesting to investigate how Algorithm 5.1 can be derived within our framework. Thereto one considers the first partial minimization problem together with the constrained second partial minimization Problem 3.9, the constraint being Q = Q 0 , for some Q 0 . Later on we will see that the particular choice of Q 0 , as long as it is invertible, is irrelevant. The concatenation of these two problems results in the EM Algorithm 5.1, as is detailed below.
Starting at (H t , D t , Q 0 ), one performs the first partial minimization that results in the matrix Performing now the constrained second minimization, according to the results of Proposition 3.10, one obtains Substitution of (5.1) into (5.2) yields One sees that the matrix Q 0 does not appear in the recursion, just as the matrices Q t do not occur in Algorithm 4.1, but we lost the second optimality property in the construction of Algorithm 4.1, due to the imposed constraint Q = Q 0 . Moreover, the EM algorithm does not enjoy the diagonal preservation property mentioned in Remark 4.2 for Algorithm 4.1, due to the presence of R t in Equation (5.3). Summarizing, both Algorithms 4.1 and 5.1 are the result of two partial minimization problems. The latter algorithm differs from ours in that the second partial minimization is constrained. In view of the extra cost incurred by the suboptimal constrained minimization, see Equation (3.9), our Algorithm 4.1 yields a better performance. We will illustrate these considerations by some numerical examples in Section 7. 6. Singular D It has been known for a long time, see e.g., Jöreskog (1967), that numerical solutions to the ML Equations (2.7), (2.8) often produce a nearly singular matrix D. This motivates the analysis of the minimization Problem 2.1 when D is constrained, at the outset, to be singular (Section 6.1), and the investigation of its consequences for the minimization algorithm of Proposition 4.3 (Section 6.2).

Approximation with Singular D
In this section, we consider the approximation Problem 2.1 under the constraint D 2 = 0 where with D 1 = D > 0 of size n 1 and D 2 = 0 of size n 2 . The constrained minimization problem can be formulated as follows.
Problem 6.1. Given > 0 of size n × n and integers n 2 and k, with n 2 ≤ k < n, minimize over (H, D) with D satisfying (6.1).
Remark 6.2. Alternatively, in Jöreskog (1967), the solution of the likelihood Equations (2.8) and (2.9) has been investigated under zero constraints on D. In this section, we work directly on the objective function of Problem 6.1.
To reduce the complexity of Problem 6.1 we will now decompose the objective function, choosing  Finally, let which, under (6.2), is equivalent to Here is the announced decomposition of the objective function. Proof. The proof follows from Lemma 9.1.
We are now ready to characterize the solution of Problem 6.1. Observe first that the second and third terms on the right-hand side of (6.3) are non-negative and can be made zero.  D). The latter problem has solution D = ( 11 ). It is remarkable that in this case the minimization problem has an explicit solution.

Algorithm When a Part of D has Zero Diagonal
In Section 6.1, we have posed the minimization problem under the additional constraint that the matrix D contains a number of zeros on the diagonal. In the present section, we investigate how this constraint affects the alternating minimization algorithm. For simplicity, we give a detailed account of this, only using the recursion (4.7) for H t . Initialize the algorithm at (H 0 , D 0 ) with where D 0 > 0, and where H 20 ∈ R n 2 ×k is assumed to have full row rank, so that n 2 ≤ k. Clearly, H 0 H 0 + D 0 is invertible. For H 0 as in Equation (6.5) Proof. See Appendix 2.
Note that the recursions of Proposition 6.6 are exactly those that follow from the optimization Problem 6.1. Comparison with (4.7) shows that, while the algorithm for the unconstrained case updates H t of size n × n, now one needs to update H t which is of smaller size n 1 × n 1 .
In the special case n 2 = k, the matrix H 0 of (6.6) is equal to zero. Therefore, H 11 1 = .
It is remarkable that in this case the algorithm reaches in one step, the optimal values are computed explicitly in Remark 6.5.

Numerical Comparisons with Other Algorithms
We briefly investigate the numerical performance of our AML Algorithm 4.1, and compare it against the performance of other algorithms. The natural competitor of AML is the EM Algorithm 5.1. After the publication of Rubin and Thayer (1982), the EM algorithm has evolved into a cohort of improved alternatives (Liu & Rubin, 1994, 1998, and more recently by Zhao et al., 2008), basically differing from the original EM on numerical implementation aspects. Most notably, in the ECME variant (Liu & Rubin, 1998), H t is updated as in the EM algorithm, but D t is updated by direct maximization of the likelihood (equivalently minimization of the I-divergence) with respect to D, keeping H fixed at the value H t+1 . This step cannot be done analytically, and is realized taking a few Newton-Raphson iterations, and Liu and Rubin (1998) suggests that one or two iterations are usually sufficient. The resulting D t+1 does not necessarily increase the likelihood with respect to D t ; therefore, a check has to be performed, and possibly the iteration has to be repeated adjusting its size. The rationale behind ECME is that the advantage in speed afforded by the direct (along the D parameter) maximization of the likelihood outweighs the drawback of having to check each iteration for actual improvement. We have derived, in the same spirit, a variant of AML retaining the H t update Equation (4.5) and replacing the D t update Equation (4.6) with the same Newton-Raphson iterations as in ECME. We named the resulting algorithm ACML. All numerical experiments in this section should be read as comparisons between the performances of AML and ACML versus EM and ECME.

Findings
To run the numerical comparisons, we have selected from the published literature five examples of correlation matrices, some of which are well known for being problematic for FA modeling. We have also constructed a sixth data set as an exact FA covariance = H H + D, selecting randomly the entries of H and D, see below. For each of the six data sets we ran the four algorithms in parallel (sharing the same initial conditions) several times, changing the initial conditions at each run. The figures at the end of the paper are plots of the I-divergence vs. iterations and have been selected to show the typical behaviors of the four algorithms for each data set. The data sets are the following correlation matrices of size n × n. Figure 1: Rubin-Thayer, n = 9, taken from Rubin and Thayer (1982). Figure 2: Maxwell, n = 9, Table 4 in Maxwell (1961), also analyzed as data set 2 in Jöreskog (1967). Figure 3: Rao, n = 9, taken from Rao (1955). Figure 4: Harman, n = 8, Table 5.3 in Harman (1967). Figure 5: Emmett, n = 9, Table I in Emmett (1949), also analyzed as data set 1 in Jöreskog (1967). Figure 6: The data set is a randomly generated covariance of the standard FA model type, i.e., = H H + γ D, with n = 20. The elements of H ∈ R 20×4 and of D ∈ R 20×20 are samples of a uniform on [1, 10]. For the choice of γ ∈ R + see below under (c2).
In all numerical experiments, the number of factors has been fixed, equal to k = 4. Initially it was found that, for a number of runs with different data sets and initial conditions, the ECME algorithm produced negative values for the diagonal matrices D t caused by a routine application of the Newton-Raphson (NR) algorithm. The NR routine has afterward been improved, implementing the restricted step version of the NR algorithm for both ECME and ACML. In all ECME and ACML experiments, we have consistently taken 2 steps of the NR algorithm at each iteration. To present the findings, we have grouped the data sets into three groups (a.), (b.), and (c.), within which we observed similar behaviors. Different behaviors are ranked according to their limit divergence and speed of convergence, with priority given to the former.
(a1) Rubin-Thayer data (Figure 1). The graphs of the EM/ECME pair are very similar to those of Liu and Rubin (1998) and we observe that the AML/ACML pair outperforms EM/ECME. The typical ranking for this data set was ACML best, followed by ECME,   AML, EM in that order. In a few occasions we observed AML best, followed by EM, ACML, and ECME. The ECME was the most sensitive to initial conditions. (a2) Maxwell data (Figure 2). The typical ranking for this data set is as above, ACML, ECME, AML, EM, in decreasing order. For both ECME and ACML we have been able to reproduce Table 5 of Jöreskog for the elements of the D-matrix, and also identified the eighth element of the D-matrix as D 8 = 0. In a few occasions ACML and ECME displayed very close behaviors, significantly outperforming AML and EM whose behaviors were also close to each other. (b1) Rao data (Figure 3). The typical ranking for the data set was ACML, AML, EM, and ECME. Sometimes it took more than 1500 iterations before a significant drop in the divergence of the best performing algorithm could be seen. The D 1 should be estimated close to zero (Jennrich & Robinson, 1969), which was usually the case for ACML, for AML and EM with slower convergence. ECME displayed different behaviors (sometimes very good), depending on the initial conditions.  (b2) Harman data (Figure 4). The typical ranking for the data set was as above, ACML, AML, EM, ECME. In our runs, ACML performed consistently best, whereas ECME consistently exhibited much larger divergences. For this data set, D 2 is known to be zero (Jennrich & Robinson, 1969). All runs of the ACML have quickly produced D 2 = 0, sometimes ECME too, although large deviations have been seen as well. AML and ME exhibited much slower convergence, often 5000 iterations were not enough. (c1) Emmett data ( Figure 5). The behavior of the four algorithms for this data set is exceptional. Very often AML and EM gave faster and better (i.e., to smaller values) convergence than ACML and ECME. (c2) True FA covariance matrix ( Figure 6).
and the selected number of factors is k = 4, this is a perfect modeling situation, with vanishing theoretical minimum divergence. The AML is the best performer, reaching null divergence extremely fast, while the ranking of the other algorithms is sensitive to the value of γ . Figure 6, for γ = 10, shows AML and EM. The pair ACML and ECME has a much worse behavior, which cannot be plotted on the same graph. For γ = 0.1, the ranking of behaviors is different. AML is still the best, immediately followed by ACML, whereas ECME and EM behave erratically and do not converge to zero. We omitted the figure.

Conclusions
Given a positive definite covariance matrix , which may be an empirical covariance, of dimension n, we considered the problem of approximating it with a covariance of the form H H + D, where H has a prescribed low number columns and D > 0 is diagonal. We have chosen to gauge the quality of the approximation by the I-divergence between the zero mean normal laws with covariances and H H + D, respectively. By lifting the minimization problem into a larger space, we have been able to develop an optimal procedure from first principles to determine a pair (H, D) that minimizes the I-divergence. As consequence, the procedure also yields an iterative alternating minimization algorithm (AML) à la Csiszár-Tusnády. As it turns out, the proper choice of the enlarged space is crucial for optimization. We have obtained a number of theoretical results that are of independent interest. The convergence of the algorithm has also been studied, with special attention given to the case where D is singular. The theoretical properties of the AML have been compared to those of the popular EM algorithm for exploratory factor analysis. Inspired by the ECME (a Newton-Raphson variation on EM), we also developed a similar variant of AML, called ACML, and in a few numerical experiments we compared the performances of the four algorithms. We have seen that usually the ACML algorithm performed best, in particular, better than ECME. In some specific experiments, AML was best, and always outperforming the EM algorithm.
then it is also a version of P(X ∈ B|Y ). One can show that a conditional version of the Radon-Nikodym theorem applies and that a conditional Radon-Nikodym derivative dP X |Y dQ X |Y exists Q Y almost surely. Moreover, one has the Q XY as factorization Taking logarithms on both sides, and expectation under P XY yields Writing the first term on the right-hand side as The decomposition of Lemma 9.1 is useful when solving I-divergence minimization problems with marginal constraints, like the one considered below.
Proposition 9.2. Let Q XY and P 0 Y be given probability distributions of a Euclidean random vector (X, Y ), and of its subvector Y , respectively. Consider the I-divergence minimization problem If the marginal P 0 Y Q 0 Y , then the I-divergence is minimized by P * XY specified by the Radon- Moreover, the Pythagorean rule holds i.e. for any other distribution P ∈ P,

5)
and one also has Proof. The starting point is Equation (9.3), which now takes the form Since the first term on the right-hand side is fixed, the minimizing P * XY must satisfy P * X |Y = Q X |Y . It follows that P * XY = P * X |Y P 0 Y = Q X |Y P 0 Y , thus verifying (9.4) and (9.6). We finally show that (9.5) holds.
where we used that any P XY ∈ P has Y -marginal distribution P 0 Y . The results above can be extended to the case where the random vector (X, Y ) := (X, Y 1 , . . . Y m ), i.e., Y consists of m random subvectors Y i . For any probability distribution P XY on (X, Y ), consider the conditional distributions P Y i |X and define the probability distribution P XY on (X, Y ): Note that, under P XY , the Y i are conditionally independent given X . The following lemma sharpens Lemma 9.1. Lemma 9.3. Let P XY and Q XY be given probability distributions of a Euclidean random vector (X, Y ) := (X, Y 1 , . . . Y m ). Assume that P XY Q XY and that, under Q XY , the subvectors Y i of Y are conditionally independent given X , then Proof. The proof runs along the same lines as the proof of Lemma 9.1.
The decomposition of Lemma 9.3 is useful when solving I-divergence minimization problems with conditional independence constraints, like the one considered below.
Proposition 9.4. Let P XY be a given probability distribution of a Euclidean random vector (X, Y ) := (X, Y 1 , . . . Y m ). Consider the I-divergence minimization problem If P XY Q XY for some Q XY ∈ Q then the I-divergence is minimized by Proof. From the right-hand side of the identity in Lemma 9.3, we see that the first I-divergence is not involved in the minimization, whereas the other two can be made equal to zero, by selecting Q Y i |X = P Y i |X and Q X = P X . This shows that the minimizing Q * XY is equal to P XY . To prove the Pythagorean rule, we first observe that trivially I(P XY |Q * XY ) = I(P XY | P XY ). (9.8) Next we apply the identity in Lemma 9.3 with Q * XY replacing P XY . In this case the corresponding Q * XY obviously equals Q * XY itself. Hence the identity reads by definition of Q * XY . Adding up Equations (9.8) and (9.9) gives the result.

Appendix 2: Proof of the Technical Results
Proof of Proposition 3.5. (First partial minimization). Consider the setup and the notation of Proposition 9.2. Identify Q with the normal N (0, ), and P with N (0, 0 ). By virtue of (9.4), the optimal P * is a zero mean normal whose covariance matrix can be computed using the properties of conditional normal distributions. In particular, * To prove that * 0 is strictly positive, note first that * 11 = > 0 by assumption. To conclude, since > 0, it is enough to note that * Finally, the relation I( * 0 || ) = I( || 11 ) is Equation (9.6) adapted to the present situation. The Pythagorean rule follows from this relation and Equation (9.7). 724 PSYCHOMETRIKA Proof of Proposition 3.7. (Second partial minimization). We adhere to the setting and the notation of Proposition 9.4. Identify P = P XY with the normal distribution N (0, ) and Q = Q XY with the normal N (0, 1 ), where 1 ∈ 1 . The optimal Q * = Q * XY is again normal and specified by its (conditional) mean and covariance matrix. Since Q * Y i |X = P Y i |X for all i, we have E Q * [Y |X ] = E P [Y |X ] = 12 −1 22 X ; moreover, Q * X = P X . Hence we find * Furthermore, under Q * , the Y i are conditionally independent given X . Then Cov Q * (Y i , Y j |X ) = 0, for i = j, whereas Var Q * (Y i |X ) = Var P (Y i |X ), which is the ii-element of 11 := 11 − 12 −1 22 21 , from which it follows that Cov Q * (Y |X ) = ( 11 ). We can now evaluate * The Pythagorean rule follows from the general result of Proposition 9.4.
Proof of Proposition 3.10. (Constrained second partial minimization). We can still apply Lemma 9.3 and Proposition 9.4, with the proviso that the marginal distribution of X is fixed at some Q 0 X . The optimal distribution Q * XY will therefore take the form Q * XY = i P Y i |X Q 0 X . Turning to the explicit computation of the optimal normal law, inspection of the proof of Proposition 3.7 reveals that under Q * we have E Q * Y X = 12 The key step in the proof is an application of the elementary identity, see e.g., Exercise 16(h) of Chapter 5 in Searle (1982), valid for all H and P of appropriate dimensions for which both inverses exist. We have already seen that R t is invertible and of the type I + H P H . Following this recipe, we compute Insertion of this result into (10.1) yields (4.7).
Proof of Proposition 6.6. (Update rule for H t , singular case). It is sufficient to show this for one iteration. We start from Equation (4.7) with t = 0 and compute the value of H 1 . To that end we first obtain under the present assumption an expression for the matrix (