A Note on Likelihood Ratio Tests for Models with Latent Variables

The likelihood ratio test (LRT) is widely used for comparing the relative fit of nested latent variable models. Following Wilks' theorem, the LRT is conducted by comparing the LRT statistic with its asymptotic distribution under the restricted model, a $\chi^2$-distribution with degrees of freedom equal to the difference in the number of free parameters between the two nested models under comparison. For models with latent variables such as factor analysis, structural equation models and random effects models, however, it is often found that the $\chi^2$ approximation does not hold. In this note, we show how the regularity conditions of the Wilks' theorem may be violated using three examples of models with latent variables. In addition, a more general theory for LRT is given that provides the correct asymptotic theory for these LRTs. This general theory was first established in Chernoff (1954) and discussed in both van der Vaart (2000) and Drton (2009), but it does not seem to have received enough attention. We illustrate this general theory with the three examples.

and Hayashi et al. (2007) were among the ones who pointed out that models for which one needs to select dimensionality (e.g. principal component analysis, latent class, factor models) have points of irregularity in their parameter space that in some cases invalidate the use of LRT. Specifically, such issues arise in factor analysis when comparing models with different number of factors rather than comparing a factor model against the saturated model. The LRT for comparing a q-factor model against the saturated model does follow a χ 2 -distribution under mild conditions. However, for nested models with different number of factors (q-factor model is the correct one against the one with (q + k)-factors), the LRT is likely not χ 2 -distributed due to violation of one or more of the regularity conditions. This is inline with the two basic assumptions required by the asymptotic theory for factor analysis and SEM: the identifiability of the parameter vector and non-singularity of the information matrix (see Shapiro, 1986, and references therein). More specifically, Hayashi et al. (2007) focus on exploratory factor analysis and on the problem that arises when the number of factors exceeds the true number of factors that might lead to rank deficiency and nonidentifiability of model parameters. That corresponds to the violations of the two regularity conditions. Those findings go back to Geweke and Singleton (1980) and Amemiya and Anderson (1990).
More specifically, Geweke and Singleton (1980) studied the behaviour of the LRT in small samples and concluded that when the regularity conditions from Wilks' theorem are not satisfied the asymptotic theory seems to be misleading in all sample sizes considered.
To further illustrate the issue with the classical theory for LRT, we provide three examples. These examples suggest that the χ 2 approximation can perform poorly and give p-values that can be either more conservative or more liberal.
Example 1 (Exploratory factor analysis) Consider to test the number of factors in exploratory factor analysis (EFA). For ease of exposition, we consider two hypothesis testing problems, (a) testing a one-factor model against a two-factor model, and (b) testing a one-factor model against a saturated multivariate normal model with an unrestricted covariance matrix. Similar examples have been considered in Hayashi et al. (2007) where similar phenomena have been studied.
1(a). Suppose that we have J mean-centered continuous indicators, X = (X 1 , ..., X J ) , which follow a J-variate normal distribution N (0, Σ). The one-factor model parameterizes Σ as Σ = a 1 a 1 + ∆, where a 1 = (a 11 , ..., a J1 ) contains the loading parameters and ∆ = diag(δ 1 , ..., δ J ) is diagonal matrix with a diagonal entries δ 1 , ..., δ J . Here, ∆ is the covariance matrix for the unique factors. Similarly, the two-factor model parameterizes Σ as Σ = a 1 a 1 + a 2 a 2 + ∆, where a 2 = (a 12 , ..., a J2 ) contains the loading parameters for the second factor and we set a 12 = 0 to ensure model identifiability. Obviously, the one-factor model is nested within the two-factor model. The comparison between these two models is equivalent to test H 0 : a 2 = 0 versus H a : a 2 = 0.
If the Wilks' theorem holds, then under H 0 the LRT statistic should asymptotically follow a χ 2 -distribution with J − 1 degrees of freedom.
We now provide a simulated example. Data are generated from a one-factor model, with J = 6 indicators and N = 5000 observations. The true parameter values are given in Table 1. We generate 5000 independent datasets. For each dataset, we compute the LRT for comparing the one-and two-factor models. Results are presented in panel (a) of Figure 1. The black solid line shows the empirical Cumulative Distribution Function   . The black solid line shows the empirical CDF of the LRT statistic, and the red dashed line shows the CDF of the χ 2 distribution with 9 degrees of freedom as suggested by Wilks' theorem.
(CDF) of the LRT statistic, and the red dashed line shows the CDF of the χ 2 distribution suggested by Wilks' Theorem. A substantial discrepancy can be observed between the two CDFs. Specifically, the χ 2 CDF tends to stochastically dominate the empirical CDF, implying that p-values based on this χ 2 distribution tend to be more liberal. In fact, if we reject H 0 at 5% significance level based on these p-values, the actual type I error is 10.8%. These results suggest the failure of Wilks' theorem in this example.

1(b).
When testing the one-factor model against the saturated model, the LRT statistic is asymptotically χ 2 if the Wilks' theorem holds. The degrees of freedom of the χ 2 distribution is J(J + 1)/2 − 2J, where J(J + 1)/2 is the number of free parameters in an unrestricted covariance matrix Σ and 2J is the number of parameters in the one-factor model. In panel (b) of Figure 1, the black solid line shows the empirical CDF of the LRT statistic based on 5000 independent simulations, and the red dashed line shows the CDF of the χ 2 -distribution with 9 degrees of freedom. As we can see, the two curves almost overlap with each other, suggesting that Wilks' theorem holds here.
Example 2 (Exploratory item factor analysis) We further give an example of exploratory item factor analysis (IFA) for binary data, in which similar phenomena as those in Example 1 are observed. Again, we consider two hypothesis testing problems, (a) testing a one-factor model against a two-factor model, and (b) testing a one-factor model against a saturated multinomial model for a binary random vector.
The exploratory two-factor IFA model parameterizes π x by where φ(·) is the probability density function of a standard normal distribution. This model is also known as a multidimensional two-parameter logistic (M2PL) model (Reckase, 2009  as the easiness parameters. We denote a 1 = (a 11 , ..., a J1 ) and a 2 = (a 12 , ..., a J2 ) . For model identifiability, we set a 12 = 0. When a j2 = 0, j = 2, ..., J, then the two-factor model degenerates to the one-factor model. Similar to Example 1(a), if Wilks' theorem holds, the LRT statistic should asymptotically follow a χ 2 -distribution with J − 1 degrees of freedom.
Simulation results suggest the failure of this χ 2 approximation. In Figure 2, we provide plots similar to those in Figure 1, based on 5000 datasets simulated from a onefactor IFA model with sample size N = 5000 and J = 6. The true parameters of this IFA model are given in Table 2. Panel (a) shows the empirical CDF of the LRT statistic and the CDF of the χ 2 distribution suggested by Wilks' theorem, using a black solid line and a red dashed line, respectively. Similar to Example 1(a), the two CDFs are not close to each other and the χ 2 CDF tends to stochastically dominate the empirical CDF.
2(b). When testing the one-factor IFA model against the saturated model, the LRT statistic is asymptotically χ 2 if the Wilks' theorem holds, for which the degree of freedom is 2 J − 1 − 2J. Here, 2 J − 1 is the number of free parameters in the saturated model, and 2J is the number of parameters in the one-factor IFA model. In panel (b) of Figure 1, the black solid line shows the empirical CDF for comparing the one-factor model and the saturated model and the red dashed line shows the CDF of the corresponding χ 2 distribution. Similar to Example 1(b), the two CDFs are very close to each other, suggesting that Wilks' theorem holds here.  Example 3 (Random effects model) Our third example considers a simple random effects model. Consider two-level data with individuals at level 1 nested within groups at level 2. Let X ij be data from the jth individual from the ith group, where i = 1, ..., N and j = 1, ..., J. For simplicity, we assume all the groups have the same number of individuals. Assume the following random effects model, where β 0 is the overall mean across all the groups, µ i ∼ N (0, σ 2 1 ) characterizes the difference between the mean for group i and the overall mean, and ij ∼ N (0, σ 2 2 ) is the individual level residual.
To test for between group variability under this model is equivalent to test If Wilks' theorem holds, then the LRT statistic should follow a χ 2 distribution with one degree of freedom. We conduct a simulation study and show the results in Figure 3. In this figure, the black solid line shows the empirical CDF of the LRT statistic, based on 5000 independent simulations from the null model with N = 200, J = 20, β 0 = 0, and σ 2 2 = 1. The red dashed line shows the CDF of the χ 2 distribution with one degree of freedom. As we can see, the two CDFs are not close to each other, and the empirical CDF tends to stochastically dominate the theoretical CDF suggested by Wilks' theorem.
It suggests the failure of Wilks' theorem in this example.
This kind of phenomenon has been observed when the null model lies on the boundary of the parameter space, due to which the regularity conditions of Wilks' theorem do not hold. The LRT statistic has been shown to often follow a mixture of χ 2 -distribution asymptotically (e.g., Shapiro, 1985;Self and Liang, 1987), instead of a χ 2 -distribution.
As it will be shown in Section 2, such a mixture of χ 2 distribution can be derived from a general theory for LRT.
We now explain why the Wilks' theorem does not hold in Examples 1(a), 2(a), and 3. We define some generic notations. Suppose that we have i.i.d. observations X 1 , ..., We assume that the distributions in P Θ are dominated by a common σ-finite measure ν with respect to which they have probability density functions p θ : R J → [0, ∞). Let Θ 0 ⊂ Θ be a submodel and we are interested in testing Let p θ * be the true model for the observations, where θ * ∈ Θ 0 .
The likelihood function is defined as and the LRT statistic is defined as Under suitable regularity conditions, the Wilks' theorem suggests that the LRT statistic λ N is asymptotically χ 2 . The Wilks' theorem for LRT requires several regularity conditions; see e.g., Theorem 12.4.2, Lehmann and Romano (2006). Among these conditions, there are two conditions that the previous examples do not satisfy. First, it is required that θ * is an interior point of Θ. This condition is not satisfied for Example 3, when Θ is taken to be {(β 0 , σ 2 1 , σ 2 2 ) : , as the null model lies on the boundary of the parameter space. Second, it is required that the expected Fisher information matrix at θ * , I(θ * ) = E θ * [∇l N (θ * )∇l N (θ * ) ]/N is strictly positive-definite. As we summarize in Lemma 1, this condition is not satisfied in Examples 1(a) and 2(a), when Θ is taken to be the parameter space of the corresponding two-factor model. However, interestingly, when comparing the one-factor model with the saturated model, the Fisher information matrix is strictly positive-definite in Examples 1(b) and 2(b), for both simulated examples.
(2) For the two-factor IFA model given in Example 2(a), choose the parameter space We remark on the consequences of having a non-invertible information matrix. The first consequence is computational. If the information matrix is non-invertible, then the likelihood function does not tend to be strongly convex near the MLE, resulting in slow convergence. In the context of Examples 1(a) and 2(a), it means that computing the MLE for the corresponding two-factor models may have convergence issue. When convergence issue occurs, the obtained LRT statistic is below its actual value, due to the log-likelihood for the two-factor model not achieving the maximum. Consequently, the pvalue tends to be larger than its actual value, and thus the decision based on the p-value tends to be more conservative than the one without convergence issue. This convergence issue is observed when conducting simulations for these examples. To improve the convergence, we use multiple random starting points when computing MLEs. The second consequence is a poor asymptotic convergence rate for the MLE. That is, the convergence rate is typically much slower than the standard parametric rate O p (1/ √ N ), even though the MLE is still consistent; see Rotnitzky et al. (2000) for more theoretical results on this topic.
We further provide some remarks on the LRT in Examples 1(b) and 2(b) that use a LRT for comparing the fitted model with the saturated model. Although Wilks' theorem holds asymptotically in example 2(b), the χ 2 approximation may not always work well as in our simulated example. This is because, when the number of items becomes larger and the sample size is not large enough, the contingency  (2019) and references therein). Yuan et al. (2015) found that under large J and small N the test over rejects the fitted model with a rejection probability much higher that the nominal one. They proposed an empirical approach based on the Bartlett correction to correct the mean of the test statistic so that is better approximated by χ 2 with a large J and/or small N . Also Moshagen (2012) concluded that the LRT for overall goodness-of-fit depends on the size of the covariance matrix (number of observed variables, J) when all other assumptions are being met (normality, no misspecifications). The paper also gives many references of other studies who found that the LRT is largely inflated when there are violations of normality (see Satorra and Bentler (2001) for the LRT statistic under non-normality).

General Theory for Likelihood Ratio Test
The previous discussions suggest that Wilks' theorem does not hold for Examples 1(a), 2(a), and 3, due to the violation of regularity conditions. It is then natural to ask: what asymptotic distribution does λ N follow in these situations? Is there asymptotic theory characterizing such irregular situations? The answer is "yes" to these questions. In fact, a general theory characterizing these less regular situations has already been established in Chernoff (1954). In what follows, we provide a version of this general theory that is proven in van der Vaart (2000), Theorem 16.7. It is also given in Drton (2009), Theorem 2.6.
We first introduce a few notations. We use R J×J pd and R J×J d to denote the spaces of J × J strictly positive definite matrices and diagonal matrices, respectively. In addition, we define a one-to-one mapping ρ: R J×J pd → R J(J+1)/2 , that maps a positive definite matrix to a vector containing all its upper triangular entries (including the diagonal entries). That is, ρ(Σ) = (σ 11 , σ 12 ..., σ 1J , σ 22 , ..., σ 2J , ..., σ JJ ) , for Σ = (σ ij ) J×J ∈ R J×J pd . We also define a one-to-one mapping µ: R J×J d → R J , that maps a diagonal matrix to a vector containing all its diagonal entries. For comparing nested models Θ 0 ⊂ Θ ⊂ R k , with θ * ∈ Θ 0 being the true parameter vector, regularity conditions C1-C5 are needed.
C1. The true parameter vector θ * is in the interior of Θ.
C2. There exists a measurable mapl θ : and the Fisher-information matrix I(θ * ) for P Θ is invertible.
The asymptotic distribution of λ N depends on the local geometry of the parameter space Θ 0 at θ * . This is characterized by the tangent cone T Θ 0 (θ * ), to be defined below.
Definition 1 The tangent cone T Θ 0 (θ * ) of the set Θ 0 ⊂ R k at the point θ * ∈ R k is the set of vectors in R k that are limits of sequences α n (θ n − θ * ), where α n are positive reals and θ n ∈ Θ 0 converge to θ * .
The following regularity is required for the tangent cone T Θ 0 (θ * ) that is known as the Chernoff-regularity.
C5. For every vector τ in the tangent cone T Θ 0 (θ * ) there exist > 0 and a map Under the above regularity conditions, Theorem 1 below holds and explains the phenomena in Examples 1(b) and 2(b).
Theorem 1 Suppose that conditions C1-C5 are satisfied for comparing nested models Θ 0 ⊂ Θ ⊂ R k , with θ * ∈ Θ 0 being the true parameter vector. Then as N grows to infinity, the likelihood ratio statistic λ N converges to the distribution of where Z = (Z 1 , ..., Z k ) is a random vector consisting of i.i.d. standard normal random variables.
Among them, (1) in C2 is also known as the condition of "differentiable in quadratic mean" for P Θ at θ * . As will be shown below, C1 can be satisfied with suitable choice of the parameter space. C2 and C3 impose certain regularities for p θ (x) as a bi-variate function of θ ∈ Θ and x ∈ R J that hold for all our examples below. C4 holds for our examples by Theorem 10.1.6, Casella and Berger (2002). C5 requires certain regularity on the local geometry of T Θ 0 (θ * ), which also holds for our examples below.
Remark 2 By Theorem 1, the asymptotic distribution for λ N depends on the tangent cone T Θ 0 (θ * ). If T Θ 0 (θ * ) is a linear subspace of R k with dimension k 0 , then one can easily show that the asymptotic reference distribution of λ N is χ 2 with degrees of freedom k − k 0 . As we explain below, Theorem 1 directly applies to Examples 1(b) and 2(b).
As the saturated model is a J-variate normal distribution with an unrestricted covariance matrix, its parameter space can be chosen as and the parameter space for the restricted model is It is easy to see that C1 holds with the current choice of Θ. The tangent cone T Θ 0 (θ * ) takes the form: which is a linear subspace of R J(J+1)/2 with dimension 2J, as long as a * j1 = 0, j = 1, ..., J.
Theorem 1 is not applicable to Example 3, because θ * is on the boundary of Θ if Θ is chosen to be {(β 0 , σ 2 1 , σ 2 2 ) : β 0 ∈ R, σ 2 1 ∈ [0, ∞), σ 2 2 ∈ [0, ∞)}, and thus C1 is violated. Theorem 1 is also not applicable to Examples 1(a) and 2(a), because the Fisher information matrix is not invertible when Θ is chosen to be the parameter space of the two-factor EFA and IFA models, respectively, in which case condition C2 is violated.
To derive the asymptotic theory for such problems, we view them as a problem of testing nested submodels under a saturated model for which θ * is an interior point of Θ and the information matrix is invertible. Consider testing where Θ 0 ⊂ Θ 1 are two nested submodels of a saturated model Θ. Under this formulation, Theorem 2 below provides the asymptotic theory for the LRT statistic λ N = 2 sup θ∈Θ 1 l N (θ) − sup θ∈Θ 0 l N (θ) .
Theorem 2 Suppose that conditions C1-C5 are satisfied for Θ 0 and Θ 1 , where Θ 0 ⊂ Θ 1 ⊂ Θ ⊂ R k . Further let θ * ∈ Θ 0 be the true parameter vector. As N grows to infinity, the likelihood ratio statistic λ N converges to the distribution of where Z = (Z 1 , ..., Z k ) is a random vector consisting of i.i.d. standard normal random variables.
Example 6 (Random effects model, revisited) Now we consider Example 3. Let 1 n denote a length-n vector whose entries are all 1, and I n denote the n × n identity matrix. As X i = (X i1 , ..., X iJ ) from the random effects model is multivariate normal with mean β 0 1 J and covariance matrix σ 2 1 1 J 1 J + σ 2 2 I J , the saturated parameter space can be taken as The parameter space for restricted models are Then, C1 holds. The tangent cones for Θ 0 and Θ 1 are By Theorem 2, λ N converges to the distribution of (4).
In this example, the form of (4) can be simplified, thanks to the forms of T Θ 0 (θ * ) and T Θ 1 (θ * ). We denote It can be seen that T Θ 0 (θ * ) is a 2-dimensional linear subspace spanned by {c 0 , c 2 }, and and then (4) has the form v Z 2 1 {v Z≥0} .
It is easy to see that v Z follows standard normal distribution.Therefore, λ N converges to the distribution of w 2 1 {w≥0} , where w is a standard normal random variable. This is known as a mixture of χ 2 -distribution. The blue dotted line in Figure 3 shows the CDF of this mixture χ 2 -distribution. This CDF is very close to the empirical CDF of the LRT, confirming our asymptotic theory.
The tangent cone of Θ 1 at θ * becomes Note that T Θ 1 (θ * ) is not a linear subspace, due to the b 2 b 2 term. Therefore, by Theorem 2, the asymptotic distribution of λ N is not χ 2 . See the blue dotted line in Panel (a) of Figure 1 for the CDF of this asymptotic distribution. This CDF almost overlaps with the empirical CDF of the LRT, suggesting that Theorem 2 holds here.
Example 8 (Exploratory item factor analysis, revisited) Now we consider Example 2(a). Let Θ, Θ 0 , θ * and T Θ 0 (θ * ) be the same as those in Example 5. Let be the parameter space for the two-factor model. Recall f x and g x as defined in Example 5. For any x ∈ Γ J , we further define H x = (h rs (x)) J×J , where Then, the tangent cone of Θ 1 at θ * is Similar to Example 7, T Θ 1 (θ * ) is not a linear subspace. Therefore, by Theorem 2, λ N is not asymptotically χ 2 . In Panel (a) of Figure 2, the asymptotic CDF suggested by

Discussion
In this note, we point out how the regularity conditions of the Wilks' theorem may be violated, using three examples of models with latent variables. In these cases, the asymptotic distribution of the LRT statistic is no longer χ 2 and therefore the test may no longer be valid. It seems that the regularity conditions of the Wilks' theorem, especially the requirement on a non-singular Fisher information matrix, have not received enough attention. As a result, the LRT is often misused. Although we focus on LRT, it is worth pointing out that other testing procedures, including the Wald and score tests, as well as limited-information tests (e.g., tests based on bivariate information), require similar regularity conditions and thus may also be affected.
We present a general theory for LRT first established in Chernoff (1954)  Besides the singularity and boundary issues, the asymptotic distribution may be inaccurate when the dimension of the parameter space is relatively high comparing with the sample size. This problem has been intensively studied in statistics and a famous result is the Bartlett correction which provides a way to improve the χ 2 approximation (Bartlett, 1937;Bickel and Ghosh, 1990;Cordeiro, 1983;Box, 1949;Lawley, 1956;Wald, 1943). When the regularity conditions do not hold, the form of Bartlett correction will be different. A general form of Bartlett correction remains to be developed, which is left for future investigation.
Proof of Theorem 2. The proof is similar to that of Theorem 16.7, van der Vaart (2000). We only state the main steps and skip the details which readers can find in van der Vaart (2000).