The effect of latent and error non-normality on corrections to the test statistic in structural equation modeling

In structural equation modeling, several corrections to the likelihood-ratio model test statistic have been developed to counter the effects of non-normal data. Previous robustness studies investigating the performance of these corrections typically induced non-normality in the indicator variables. However, non-normality in the indicators can originate from non-normal errors or non-normal latent factors. We conducted a Monte Carlo simulation to analyze the effect of non-normality in factors and errors on six different test statistics based on maximum likelihood estimation by evaluating the effect on empirical rejection rates and derived indices (RMSEA and CFI) for different degrees of non-normality and sample sizes. We considered the uncorrected likelihood-ratio model test statistic and the Satorra–Bentler scaled test statistic with Bartlett correction, as well as the mean and variance adjusted test statistic, a scale-shifted approach, a third moment-adjusted test statistic, and an approach drawing inferences from the relevant asymptotic chi-square mixture distribution. The results indicate that the values of the uncorrected test statistic—compared to values under normality—are associated with a severely inflated type I error rate when latent variables are non-normal, but virtually no differences occur when errors are non-normal. Although no general pattern regarding the source of non-normality for all analyzed measures of fit can be derived, the Satorra–Bentler scaled test statistic with Bartlett correction performed satisfactorily across conditions.

A crucial issue in structural equation modeling (SEM)-as in any statistical modeling technique-is the reliable evaluation of model fit to assess how well a particular model describes the data. In the context of SEM, the likelihoodratio model test (LRT) statistic based on maximum likelihood (ML) estimation comparing the fit of the investigated model against the saturated model is the most widely used (Savalei & Kolenikov, 2008). The LRT statistic is derived based on the assumption that the observed variables follow a multivariate normal distribution. In case of non-normality, however, the type I error rates of the LRT statistic are-sometimes grossly-inflated (e.g., Curran et al., 1996;Maydeu-Olivares, 2017;Nevitt & Hancock, 2004). Given that the assumption of normally distributed data rarely holds in substantive research (e.g., Blanca et al., 2013;Cain et al., 2017;Micceri, 1989), several corrections to the LRT statistic have been developed aiming at modifying the test statistic to more closely follow the asymptotic chi-square distribution under conditions of non-normality (e.g., Asparouhov & Muthén, 2010;Lin & Bentler, 2012;Satorra & Bentler, 1994).
Nevertheless, as the abovementioned studies focus on manipulating the distribution of the observed indicator variables (often relying on the approach by Vale and Maurelli, 1983), they omit an important aspect: The genuine factor analytic structure of SEM defines the indicator variables as the sum of the weighted latent factors and error terms X = Λξ + ε, with X containing the indicator variables, Λ collecting the loadings of the latent factors ξ and ε containing the error terms. Correspondingly, non-normality in the indicator variables can originate from non-normally distributed latent factors or from non-normally distributed errors (Auerswald & Moshagen, 2015). This distinction between non-normal latent factors and errors has been addressed by asymptotic robustness theory (Amemiya & Anderson, 1990;Browne, 1987;Browne & Shapiro, 1988;Mooijaart & Bentler, 1991;Shapiro, 1987), which specifies conditions under which normal theory test statistics asymptotically follow a chi-square distribution if the sample size N → ∞, even if the normality assumption is violated. For example, Amemiya and Anderson (1990) considered the following model: For 1 ≤ i ≤ N, x i is the observable p × 1 random vector, μ is a p × 1 parameter vector, Λ is a p × h loading matrix, f i is an h × 1 unobservable factor vector, and u i is a p × 1 unobservable error vector. In this case, normal theory test statistics are asymptotically chi-square distributed if the f i are independently and identically distributed (i.i.d.) with any distribution with finite variance, if the u i are i.i.d. with any distribution with finite variance, and the p components of u i are independent.
However, it is important to note that asymptotic robustness theory only guarantees that the test statistics follow a chi-square distribution asymptotically, so that the actual distribution in finite (and realistic) samples might diverge substantially. Whereas few simulation studies have investigated the effect of the underlying multivariate distribution on the manifest variables by generating data based on non-normal latent factors (e.g., Molenaar et al., 2010;Schmitt et al., 2006), these did not systematically compare the effect of non-normal latent factors versus non-normal errors. An exception is a (1) small simulation study by Auerswald and Moshagen (2015), where data were generated based on non-normally distributed latent factors as well as on non-normally distributed errors, respectively. This study provided evidence that the source of non-normality has an important effect on the uncorrected as well as on the Satorra-Bentler scaled test statistic in finite samples. Specifically, they found that the type I error rates of these statistics are inflated in the case of non-normal latent variables but not in the case of non-normal errors. However, these sources of non-normality were commonly confounded in previous simulation studies. To gain a more profound understanding of how the multivariate distribution (i.e., the distribution of latent factors and errors) affects corrections to the test statistics in finite samples, we thus extended the study of Auerswald and Moshagen (2015), which was limited by investigating the behavior for one sample size (N = 500) and by considering only one degree of non-normality. The present study relies on more comprehensive analyses including several test statistics correcting not only the mean but higher-order moments (see below for details) and investigates the effects of different extents of non-normality in sample sizes that are commonly encountered in substantive research.
The present study thus aims to answer the question to which extent the results from previous robustness studies are valid if the source of non-normality is considered. To this end, we relied on the NOTAMO (NOrmal To Arbitrary MOments) algorithm (Auerswald, 2017), which can be used to generate marginal distributions (i.e., the distribution of the indicator variables) sharing prespecified central moments that nevertheless differ in their multivariate distributions. NOTAMO induces non-normality in latent factors or errors, so the source of non-normality can be manipulated. We investigated the effect of the source of non-normality on the uncorrected LRT statistic based on normal theory ML estimation as well as on several corrections adjusting different central moments. This is of particular interest given that NOTAMO allows one to manipulate the source of non-normality while holding the central moments of the marginal distributions constant. Beyond considering moment-corrected test statistics, we also investigated an approach that directly estimates the underlying limiting chi-square mixture distribution to draw inferences.
The test statistic to evaluate the overall model fit in SEM depends on the sample estimate of the minimum of the fit function F = F S, ̂ , where the parameter estimates ̂ are determined in such a way that they minimize the discrepancy between the model-implied variance-covariance matrix ̂ and the empirical variance-covariance matrix S (for details, see Bollen, 1989). ML estimates can be obtained based on the weighted least squares (WLS) fit function where s and ̂ represent a vector with the unique elements of S and ̂ , respectively, and W denotes a weight matrix. When the unique elements of ̂ are used as weights, the WLS estimates are asymptotically equivalent to the estimates obtained based on ML estimation given by with p indicating the number of manifest variables (for details see Bollen, 1989;Browne, 1974). That means that although the fit functions in Eqs. 2 and 3 differ, both can be used to obtain ML estimates. Note that both fit functions refer to models without mean structure (see e.g., Hayashi et al., 2007 for details regarding fit functions for mean and covariance structures).
Given the validity of a set of assumptions, the asymptotic distribution of the (Wishart) LRT statistic T ML =F ML (N − 1) under the null hypothesis (i.e., the population variancecovariance matrix equals the model-implied variancecovariance matrix) follows a chi-square distribution with df = p * − q degrees of freedom with p * = p(p+1) 2 and q as the number of free model parameters (for details, see Bollen, 1989). If the asymptotic robustness condition holds, T ML also follows the same chi-square distribution as N → ∞. More generally, the test statistic can be shown to follow a weighted mixture distribution of independent chi-square variables with 1 degree of freedom each where w j denotes the weights (Satorra & Bentler, 1994). The weights are the non-null eigenvalues of UΓ, where U is the residual weight matrix defined as with = ( ) = � denoting the p * × q Jacobian matrix and Γ referring to the asymptotic variance-covariance matrix of the distribution of where σ 0 is a vector with the unique elements of the population variance-covariance matrix, Σ 0 (Browne, 1984;Satorra & Bentler, 1994).
As is immediately evident from Eq. 4, the actual distribution of T ML can only be appropriately described by an unweighted chi-square distribution when all weights are equal to one. If the weights disperse around one, as happens, for instance, when the normality assumption is violated (Brosseau-Liard et al., 2012;Satorra & Bentler, 1994), the test statistic follows a chi-square weighted mixture distribution. In such cases, drawing inferences from an unweighted reference chi-square distribution leads to incorrect conclusions.
Based on this observation, the core idea of many corrected test statistics is to rely on the (unweighted) chi-square reference distribution to draw inferences but to adjust T ML by the estimated weights, so that certain moments of the distribution are asymptotically equal to the respective moments of the unweighted reference chi-square distribution. The Satorra-Bentler scaled chi-square test statistic T M adjusts the mean of the test statistic leading to an approximate chi-square distribution with asymptotically correct mean (i.e., the degrees of freedom of the test statistic): with the scaling factor c = tr(Û̂ ) df and tr Û̂ as the expected value of the asymptotic distribution of the test statistic (Satorra & Bentler, 1994). As issues regarding the performance of T M in small samples have been reported in the literature (e.g., Nevitt & Hancock, 2004;Savalei, 2010), we applied the Bartlett (1950) correction to T M leading to T MB given by where h represents the number of latent factors. Rather than just adjusting the mean, the Satorra-Bentler adjusted chi-square test statistic T MV1 given by results in an approximate chi-square distribution of the test statistic with d = tr(Û̂ ) 2 tr (Û̂ ) 2 degrees of freedom and asymptotically correct mean and variance (Satorra & Bentler, 1994). A related correction scales and shifts the underlying distribution (Asparouhov & Muthén, 2010). This correction leads to a test statistic with df degrees of freedom and asymptotically correct mean (i.e., df) and variance (i.e., 2 df). The corrected test statistic is defined as Beyond correcting the mean and variance, the third moment adjusted test statistic T MS (Lin & Bentler, 2012) adjusts the mean and the skewness of the test statistic via The corrected test statistic T MS asymptotically follows a chi-square distribution with v degrees of freedom and shares its mean and skewness values with the unweighted reference chi-square distribution. Instead of correcting T ML by adapting particular standardized moments so that it more closely follows the expected unweighted chi-square reference distribution, an alternative procedure is to rely on the uncorrected T ML , but to draw inferences from the proper asymptotic weighted mixture distribution as defined in Eq. 4. The weighted chi-square mixture distribution can be estimated using the (non-null) eigenvalues of Û̂ as weights w j . The resulting weighted mixture distribution has an expected value of tr Û̂ and a variance of tr Û̂ 2 (for details, see Satorra & Bentler, 1994). However, by constructing the mixture distribution using all weights, the resulting distribution should approximate the actual distribution of T ML concerning all higher order moments, rather than just the mean, variance, and/or skewness. Throughout this paper, we refer to this test statistic as T mix . Note that T mix has the same chi-square value as T ML ; however, the p-value of the former might differ from that of the latter because of the different reference distributions involved.
Finally, we also considered derived fit indices, i.e., the root mean square error of approximation (RMSEA; Steiger, 2016;Steiger & Lind, 1980) and the comparative fit index (CFI; Bentler, 1990). To maintain comparability with previous robustness studies, we additionally simulated control conditions directly manipulating the distribution of the indicator variables by means of the Vale-Maurelli (VM; Vale & Maurelli, 1983) approach for non-normal data and by means of eigendecomposition for multivariate normal data.
Based on previous findings, we expected the uncorrected test statistic to perform best under conditions of multivariate normality and to observe inflated type I error rates with an increasing degree of non-normality, in particular when nonnormality arises from non-normal latent variables. Whereas the corrections under scrutiny are expected to recover the true population value more closely regardless of the degree of nonnormality (e.g., Curran et al., 1996;Tong & Bentler, 2013), we also expect an effect of the source of non-normality on these outcomes as indicated by Auerswald and Moshagen (2015).

Methods
We created various experimental conditions to assess the effect of the source of non-normality on different test statistics by considering three non-normality conditions (latent, where non-normality in indicator variables originated from non-normal latent factors; error, where non-normality in indicator variables originated from non-normal errors; and marginal, where non-normality was directly induced in the indicator variables), four sample sizes (N = 200, 400, 600, and 1000), six test statistics (one uncorrected, four with corrected moments, and one estimating the limiting weighted mixture distribution), three degrees of kurtosis (k = 3, 10, and 17), and two specification statuses of the model (correct versus incorrect). We considered different measures of model fit, namely the rejection rates of the LRT statistic as well as the RMSEA and the CFI, as both depend on the LRT statistic and are thus affected by the analyzed corrections. Data generation and analysis were performed with the opensource software R (R Core Team, 2020) using the package lavaan (version: 0.6-6; Rosseel, 2012) for model estimation and the package distrEx (version 2.8.0; Ruckdeschel et al., 2019) to estimate the weighted mixture distribution.

Population and analysis models
We defined a factor analytic model in the population with three latent factors and p = 15 indicator variables. Data generation was based on the variance-covariance matrix in the population Σ 0 given by Σ 0 = ΛΦΛ ′ + Θ, where Λ′ represents the transposed matrix of loadings and Φ is the variance-covariance matrix of the latent factors. The elements of the diagonal residual matrix Θ were defined such that the respective squared loadings of Λ summed up with the respective residual term to one. We defined a correlated three-factor model with six nonzero secondary loadings and a variance-covariance matrix between the factors of In conditions involving correctly specified models, all secondary loadings were freely estimated, whereas a confirmatory factor analysis model with three factors and no secondary loadings was estimated in conditions considering misspecified models. The conditions involving misspecifications were associated with a population minimum of the fit function of F 0 = 0.328. Table 1 further shows the expected power (Jobst et al., in press;Moshagen & Erdfelder, 2016) of the LRT statistic as well as the population values of descriptive indices of model fit.

Data generation
Based on the population model described above, we drew 1000 random samples each (only valid solutions that converged were considered) with N = 200, 400, 600, and 1000 observations, respectively, mimicking common sample sizes in psychological research using factor analytic methods (e.g., DiStefano & Hess, 2005;Jackson et al., 2009). All generated observed variables X i with 1 ≤ i ≤ p had a kurtosis of either k = 3, 10, or 17 representing values that were observed in substantive research (e.g., Blanca et al., 2013;Cain et al., 2017). We specified the distribution of the indicator variables by either manipulating the multivariate distribution-based on non-normal errors or non-normal latent factors-or by directly drawing samples from marginal distributions with the respective kurtosis. In conditions with k = 10 and k = 17, respectively, the VM approach was used to induce non-normality in the marginal distributions. Moreover, we generated a multivariate normal control condition based on eigendecomposition (marginal condition with k = 3). Note that a marginal kurtosis of three and a skewness of zero can arise when data are multivariate normal, which we realized in the marginal condition for k = 3. Nevertheless, it is also possible to obtain multivariate non-normal data exhibiting the same values regarding skewness and kurtosis as a multivariate normal distribution (i.e., skewness of zero and kurtosis of three) but differing in higher order moments, which we realized in conditions with k = 3 under latent and error non-normality. This setup thus allows for the comparison between both multivariate normal and multivariate non-normal data sharing their skewness and kurtosis values.
Non-normality based on the multivariate distribution was created relying on the NOTAMO framework (Auerswald, 2017). Within this framework, the indicator variables X i are defined as the sum of two random variables L i and E i . All L i are correlated amongst each other, whereas all E i are independent from all other E i as well as from all L i . Depending on the particular non-normality condition, the distributions of L i and E i vary: In conditions with nonnormal latent variables, all L i follow a non-normal distribution and all E i are standard normally distributed. In conditions with non-normal errors, all L i are standard normally distributed, but all E i follow a non-normal distribution. Non-normal L i and E i , respectively, were generated with the NORTA algorithm (Cario & Nelson, 1997) requiring an inverse cumulative distribution function F −1 as input. We used NOTAMO to identify a suitable inverse cumulative distribution function for each random variable X i that complied with the prespecified central moments. NOTAMO defines the target inverse cumulative distribution function F −1 as a weighted sum described by the following quantile mixture distribution: with β 1 , …, β l as positive weights and ∑ l m=1 m = 1. Depending on the experimental condition, we varied the input inverse cumulative distribution functions F −1 m across conditions. In conditions with k = 3 and non-normal latent factors, we used a t-distribution with 4.1 degrees of freedom and a uniform distribution on the interval [0, 1]. For k = 3 and non-normal errors, a cubic standard normal distribution, a uniform distribution on the interval [0, 1], and a standard normal distribution were used to define F −1 . In conditions with k = 10 and non-normal errors as well as non-normal latent factors, we used a log-normal distribution with a mean of zero and standard deviation of one, as well as an exponential distribution with a rate of one, as input functions. The same two inverse cumulative distribution functions were used in conditions with k = 17 and non-normal latent factors. In conditions with non-normality in errors and k = 17, the input functions for the quantile mixture were a standard normal distribution and a mixture distribution based on a log-normal distribution and a negative log-normal distribution.

Study outcomes
We used the percentage of empirical p-values equal or smaller than the nominal significance level of .05 (i.e., the empirical rejection rates) of the LRT statistic as an indicator of type I error when estimating correctly specified models and as an indicator of the actually achieved power when estimating misspecified models. Moreover, we evaluated the performance of the RMSEA and the CFI, because both fit indices directly depend on the LRT statistic. For both fit indices, we used the median across the 1000 replications as the respective point estimate. The RMSEA in the population is defined as with F 0 indicating the population discrepancy and thus setting the minimum of the fit function in relation to the degrees of freedom of the model (Steiger, 2016;Steiger & Lind, 1980). The CFI (Bentler, 1990) expresses the proportional reduction in misfit by comparing the minimum of the fit function based on a null model (nm)-where all covariances equal zero-against the minimized fit function of the hypothesized model leading to We obtained sample estimates for the uncorrected test statistic T ML based on and The RMSEA and CFI of T mix were also computed based on Eqs. 14 and 15, respectively, but df was replaced by tr Û̂ . Table 2 provides the sample formulas of both fit indices considering the moment-based corrections to T ML . The uncorrected as well as the corrected sample estimates approximate the population values as in Eqs. 12 and 13, respectively (for details, see Brosseau-Liard et al., 2012;Brosseau-Liard & Savalei, 2014;Savalei, 2018). Within any one condition, the underlying correction approach was also applied to the null model.

Results
To maintain scope and increase clarity, we only present an illustrative subset of the results. The complete data and further results are available as supplementary materials in the open science framework (OSF) repository at https:// osf. io/ fxnsu/.

Effect on empirical rejection rates and empirical power
We relied on the liberal robustness criterion suggested by Bradley (1978) deeming rejection rates within the interval [α ± 0.5α] acceptable (i.e., [2.5%, 7.5%] based on a significance level α of .05). The underlying multivariate distribution revealed no relevant effect in conditions with k = 3. The rejection rates of most test statistics were close to the nominal level of 5.0% (see Fig. 1). Exceptions pertained to T ML and T MS in small samples, where the rejection rates were above the robustness criterion for T ML and below the robustness criterion for T MS .
With an increasing extent of non-normality (i.e., k > 3), T ML showed increasing empirical rejection rates regardless of sample size-in particular in conditions with latent nonnormality-with empirical rejection rates up to 36.8%. By contrast, the rejection rates were much lower under error non-normality and were above the robustness criterion only when k = 17, but decreased with increasing sample size. T MB yielded adequate rejection rates and exhibited only a slight tendency to overreject a fitting model when k = 17 in the case of latent non-normality for N = 1000 and under marginal non-normality for N = 200. There were only minor differences between the remaining test statistics, which tended to underreject models in all conditions when k > 3. Whereas T MS indicated a too optimistic fit across conditions, all remaining corrected test statistics showed acceptable rejection rates with increasing sample size if only a medium extent of non-normality was present. However, they exhibited rejection rates below the robustness criterion in conditions with k = 17, especially under error and marginal non-normality.
Based on the misspecification in the population, the expected power to reject the models was at least 0.986 (see Table 1). As summarized in Table 3, T ML and T MB closely recovered the expected power across all conditions. When N = 200, T MS was associated with very low power regardless of the extent of kurtosis. The remaining test statistics yielded a power close to the expected values in small samples when k = 3, but too few rejections occurred with an increasing extent of non-normality, in particular under latent and marginal nonnormality. As the sample size increased, the empirical power was generally adequate. Thus, the results concerning power generally mirror the results concerning the empirical rejection rates of correctly specified models by suggesting that all approaches other than T ML and T MB show a tendency to retain an incorrect model as non-normality increases.

Effect on RMSEA
To summarize (see supplement for details), no effect of the multivariate distribution occurred for k = 3, whereas in conditions with k > 3 the point estimates of all test statistics were larger under marginal and latent non-normality compared to error non-normality. In general, the approximation of the population RMSEA 0 improved in larger samples across test statistics and sources of non-normality. In misspecified models (Fig. 2), the point estimates of RMSEA ML , RMSEA MB , RMSEA MV2 , and RMSEA mix closely recovered RMSEA 0 with a maximum difference between 0.008 and 0.011 depending on the involved test statistic. In contrast, the maximum difference was 0.06 for RMSEA MV1 and 0.147 for RMSEA MS . Again, no effect of the source of non-normality was evident in conditions with k = 3. However, RMSEA MV1 and RMSEA MS distinctly differed from the population value and although this difference diminished with increasing sample size, the point estimates still exhibited a positive bias even in the largest sample size condition. When k = 10, the effect of the source of nonnormality was rather small, yet a more pronounced pattern could be observed in conditions with k = 17: Whereas RMSEA ML was virtually unaffected by the source of nonnormality, the remaining test statistics yielded larger values in error non-normality conditions compared to both other sources of non-normality.

Effects on CFI
Similar to empirical rejection rates and the RMSEA, no effect of the source of non-normality on the CFI in correctly specified models occurred in conditions with k = 3 (see supplement for details). In conditions with larger kurtosis, all test statistics exhibited smaller point estimates under latent and marginal non-normality than under error non-normality, especially in small samples. In misspecified models (see Fig. 3), the maximum difference between the CFI point estimate and CFI 0 was −0.017 for CFI ML , −0.006 for CFI MB , 0.012 for CFI MV2 , and −0.014 for CFI MV1 , CFI MS , and CFI mix , respectively. The effect of the source of non-normality became visible in small samples when k > 3: CFI ML , CFI MV1 , CFI MS , CFI mix provided a closer approximation of CFI 0 under error non-normality than under latent non-normality.
However, the observed bias diminished with increasing sample size. In contrast, CFI MB was virtually unaffected by the source of non-normality across conditions. A different pattern occurred for CFI MV2 , where larger values were observed under marginal and latent non-normality than under error non-normality. As the point estimates increased with increasing sample size, this led in turn to a close approximation of CFI 0 under marginal and latent non-normality in small samples and to virtually unbiased point estimates under error non-normality in larger samples.

Discussion
Non-normal data regularly occur in substantive research, so yielding valid test statistics and descriptive indices of model fit under such conditions is of particular importance. Whereas a number of corrections to the LRT statistic (and hence, derived fit indices) has been proposed, previous robustness studies usually created non-normality by manipulating the marginal distributions only and thus did not consider the source of non-normality. The present study provides evidence that the uncorrected test statistic, four corrected test statistics, and one test statistic based on both the weighted chi-square mixture distribution and the derived fit indices are affected by the source of non-normality in finite samples, even when the manifest variables exhibited the same levels of kurtosis. Note that the manipulation of other standardized moments such as skewness would also induce non-normality in manifest variables. However, studies indicate that psychological variables exhibit a larger range of kurtosis values compared to skewness values. Additionally, these variables show a wider range of kurtosis values regarding leptokurtic distributions compared to platykurtic distributions (Blanca et al., 2013;Cain et al. 2017). Hence, we decided to investigate non-normality conditions based on leptokurtic data allowing for the generation of data sets distinctly differing in the extent of non-normality but still representing values that can be observed by substantive researchers (see e.g., Curran et al., 1996). In line with previous robustness studies (e.g., Curran et al., 1996;Nevitt & Hancock, 2004), the uncorrected ML test statistic was associated with inflated type I error rates in the case of non-normally distributed data. However, when considering the source of non-normality, we showed that non-normal errors do not lead to increased rejection rates, which is consistent with the findings of Auerswald and Moshagen (2015). Thus, the uncorrected ML test statistic appears to be robust in finite samples when non-normality arises from non-normal errors but not when non-normality arises from non-normal latent variables. All corrected test statistics were also affected by the source of non-normality, albeit to a smaller extent as compared to the uncorrected T ML . In particular, the Satorra-Bentler scaled test statistic with Bartlett correction (T MB ) performed well across conditions by closely recovering the nominal significance level in correct models and closely approximating the expected power to reject incorrect models. By contrast, all remaining corrections under scrutiny showed rejection rates below the nominal significance level and a lower statistical power than expected. Marginal non-normality showed both of these effects, whereas the former was especially apparent in conditions of error nonnormality and the latter occurred primarily under latent non-normality.
Correcting T ML by the first standardized moment (i.e., mean) greatly improved its performance, whereas correcting further moments generally led to a tendency to retain models. This is unexpected as-from a theoretical point of view-corrections of higher-order moments should result in further improvements. A similar pattern of results was also evident for the approach to draw inferences from the estimated limiting mixture distribution. As this approach does not correct for particular standardized moments but directly estimates the underlying weighted mixture distribution, we expected a superior performance. The results, however, show that corrections of higher order moments, and especially the estimated weighted chi-square mixture distribution-where all moments should be correct, as the underlying distribution is directly estimated-generally were associated with an underestimation tendency, thus leading to an inadequate type I error control and a lack of statistical power. A possible explanation for the comparatively poor performance of all approaches attempting to correct for additional moments beyond the mean might lie in unreliabilities regarding the estimation of the weights via the eigenvalues of Û̂ . The estimation errors of these weights have more severe consequences in non-linear corrections (such as the corrections for higher order moments) than in linear corrections (such as the correction for the first standardized moment), so the correction applied in T MB might be less affected by incorrectly estimated weights, in turn leading to the observed superior performance.
Beyond the LRT statistic itself, we also investigated derived descriptive fit indices computed from the respective uncorrected or corrected test statistics. Concerning the RMSEA in correctly specified models, the source of non-normality had an effect on all versions of RMSEA but its magnitude varied across the analyzed test statistics. In misspecified models, all RMSEA based on corrected test statistics were affected by the source of nonnormality by yielding larger values in error non-normality conditions than both other sources of non-normality; however, the observed bias was generally small to moderate. Exceptions pertain to RMSEA MV1 and RMSEA MS, which strongly overestimated the population RMSEA 0 leading to a too negative fit evaluation.
Concerning the CFI in correctly specified models, smaller point estimates occurred under marginal and latent non-normality compared to error non-normality, especially in small samples, regardless of the underlying test statistic. An effect of the source of non-normality also became evident in misspecified models, where CFI ML , CFI MV1 , CFI MS , and CFI mix showed a stronger bias under latent compared to error non-normality. Nevertheless, with increasing sample size the bias diminished for all test statistics except for CFI MV2 , whose bias depended on the sample size and the source of non-normality.

Conclusion
To assess model fit in substantive research, it is recommended to not rely on a single criterion but to consider various measures of fit (for an overview, see West et al., 2012). Whereas we showed that T ML is virtually unbiased when non-normality arises from non-normal errors, the source of non-normality is unknown in practice. In case of non-normal data, we thus recommend relying on the Satorra-Bentler scaled (i.e., mean-corrected) test statistic with Bartlett correction (T MB ), which performed satisfactorily throughout conditions regardless of the particular source of non-normality. Generally, all remaining corrections considered herein (T MV1 , T MV2 , T MS, T mix ) revealed systematic biases in at least some conditions, in particular when latent variables were non-normal. Thus, we recommend against their use. This general recommendation also extends when considering RMSEA or CFI as descriptive indices of fit, because indices based on T MV1 , T MV2 , T MS did not perform well, whereas RMSEA and CFI based on T MB performed satisfactorily overall. The results also indicate that a better approximation can be obtained when using the degrees of freedom as obtained by tr Û̂ , as we have done for the T mix approach. In general, we encourage researchers to consider distributional information such as the expected value and use unbiased sample estimates of descriptive fit indices.
To summarize, we demonstrated that the source of nonnormality has an effect not only on the uncorrected but also on corrected test statistics, which is especially relevant as these corrections are used to deal with non-normal data. No general pattern could be identified because the particular effects on measures of fit depend on variables like the applied test statistic or the specification status of the model. However, the present work shows that some test statistics are rather robust regarding the source of non-normality, whereas others are strongly affected by non-normal latent factors but are not necessarily affected by non-normal errors. Although the six investigated test statistics showed varying patterns across the analyzed conditions, T MB seems suitable to correct for non-normality regardless of the extent or source of non-normality and thus appears to be a reasonable choice to evaluate model fit in the presence of non-normal data. Concerning RMSEA and CFI as descriptive indices of fit, we suggest relying on robust versions based on T MB approximating the same population value as versions of these indices based on the uncorrected ML LRT statistic.
Funding Open Access funding enabled and organized by Projekt DEAL.

Declarations
Funding Open Access funding enabled and organized by Projekt DEAL. No funding was received for conducting this study.

Conflicts of interest
The authors have no relevant financial or nonfinancial interests to disclose. We have no known conflict of interest or sources of financial support to disclose.

Ethics approval Not applicable.
Consent to participate Not applicable.

Consent to publish Not applicable.
Open Practices Statement -Availability of data and materials Data and supplemental materials of the current study are available in the Open Science Framework (OSF) at https:// osf. io/ fxnsu/. The simulation study was not preregistered.
Open Practices Statement -Code availability A script illustrating how the analyzed test statistics can be obtained is available in the Open Science Framework (OSF) at https:// osf. io/ fxnsu/. The code used to generate and analyze the data of the current study is available from the corresponding author on reasonable request.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.