Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Lin, John J. H.; Hsu, Hsien-Yuan

doi:10.3758/s13428-022-02014-z

Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Published: 23 November 2022

Volume 55, pages 4222–4259, (2023)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Download PDF

851 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

We conducted a Monte Carlo study to examine the performance of level-specific χ² test statistics and fit regarding their capacity to determine model fit at specific levels in multilevel confirmatory factor analysis with dichotomous indicators. Five design factors—numbers of groups (NG), group size (GS), intra-class correlation (ICC), thresholds of dichotomous indicators (THR), and factor loadings (FL)—were considered in this study. According to our simulation results, we recommend that practitioners should be aware that the performance of between-level-specific (b-l-s) χ² and fit indices was mainly influenced by ICC and FL, followed by NG. At the same time, THR could slightly weigh in the performance of b-l-s fit indices in some conditions. Both b-l-s χ² and fit indices were more promising indicators to correctly indicate model fit when ICC or FL increased. A small to medium NG (50–100) might be sufficient for b-l-s χ² and fit indices only if both ICC and factor loadings were high, while in remaining conditions, an NG of 200 was needed. Moreover, practitioners could use within-level-specific (w-l-s) χ² and fit indices (except for RMSEA_W) along with traditional cut-off values to evaluate within-level models comprising dichotomous indicators. W-l-s χ² and fit indices were more promising to determine model fit when FL increased. THR had a slight impact and could weigh in the performance of ${\chi}_W^2$, RMSEA_W, CFI_W, and TLI_W. Unfortunately, RMSEA_W was heavily affected by FL and THR and could determine model fit only when FL was high and THR was symmetric.

What is Qualitative in Qualitative Research

Article Open access 27 February 2019

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

How to use and assess qualitative research methods

Article Open access 27 May 2020

Introduction

Multilevel confirmatory factor analysis (MCFA), or multilevel measurement modeling, is an imperative statistical approach to validate latent constructs that underlie multiple item responses collected in multilevel settings (e.g., students nested within schools). MCFA has been widely used in psychological and educational research, and relevant reporting guidelines have been made available for substantive researchers (Kim et al., 2016). Unlike conventional confirmatory factor analysis, the construct of interest in MCFA could be at multiple levels, resulting in challenges for substantive researchers in model specification, evaluation, and interpretation. Due to this complexity, extensive efforts have been devoted to providing instructions for appropriate specification and interpretation of MCFA (Stapleton, McNeish et al., 2016a; Stapleton, Yang et al., 2016b).

Regarding model evaluation, a body of research has shown that traditional fit indices [e.g., root mean square error of approximation (RMSEA), comparative fit index (CFI), Tucker–Lewis Index (TLI), and standardized root mean square residual (SRMR)] are not sensitive to misspecified between-level models in multilevel structural equation modeling (MSEM) (Hsu et al., 2017; Padgett & Morgan, 2020). For this reason, researchers have advocated the use of level-specific (l-s) fit indices for evaluating the within-level model and the between-level model separately (Hox, 2010; Hsu et al., 2017; Rappaport et al., 2020; Ryu, 2014; Ryu & West, 2009; Schermelleh-Engel et al., 2014; Wu et al., 2017). According to Ryu and West (2009), the l-s fit indices could be straightforwardly computed by the partially saturated-model method (PS method). Note that the existing guidelines for using level-specific fit indices are based on population multilevel measurement models with continuous indicators, and few address ordered categorical variables (Padgett & Morgan, 2020). Yet it is unclear whether these guidelines could be applicable to models with dichotomous indicators.

To shed light on this issue, this study endeavors to examine the performance of level-specific χ² test statistics and fit indices derived from the PS method in terms of their sensitivity to lack of fit at specific levels in MCFA with dichotomous indicators. In addition, the effectiveness of alternative level-specific fit indices obtained from Mplus—SRMR for the within-level model (SRMR_w) and for the between-level model (SRMR_B)—was compared with PS-level-specific fit indices.

MCFA with dichotomous indicators in multilevel structural equation modeling

For simplicity, we consider a two-level single-factor model. Let Y_pig denote the p^th dichotomous indicator (i.e., latent variable indicator) for individual i nested within g group (p = 1…P dichotomous indicators, i = 1…N individuals, and g = 1…G groups),

$${Y}_{pig}=\left\{\begin{array}{c}1, if\ {y}_{pig}^{\ast }>\tau \\ {}0, if\ {y}_{pig}^{\ast}\le \tau \end{array}\right.$$

(1)

The equation expresses a threshold model which assumes that underlying the dichotomous indicator Y_pig is a normally distributed continuous latent variable${y}_{pig}^{\ast }$, which can determine the category of dichotomous indicator by the threshold (τ) (Asparouhov & Muthen, 2007; Bollen, 2002). That is, the indicators of interest are conceptualized as continuous, but the format of response to each indicator is in a restrictive, dichotomous scale (Bollen, 2002). For example, if the i^th individual falls short of the threshold, the response of this individual would be 0. If the i^th individual passes the threshold, the response of this individual would be 1.

Using similar notations to those used by Padgett and Morgan (2021), in this section, we outline a two-level measurement model with dichotomous indicators, as illustrated in Fig. 1. Note that the model in Fig. 1 was also adopted as a population model for simulated dataset generation in the current study. Using the between-and-within specification approach (Hox, 2010; Muthen, 1994), the covariance structure is partitioned into a within-level component (denoted by a W subscript) and a between-level component (denoted by a B subscript). Separate models are specified for each component. The within-level component captures individual-level variation, while the between-level component captures variation between groups.

Let y_ig denote the p-dimensional response vector for student i in group g. The response vector y_ig is decomposed as:

$${y}_{ig}=\mu +{y}_{w_{ig}}+{y}_{B_g}$$

(2)

where μ represents the grand mean, and ${y}_{w_{ig}}$ and ${y}_{B_g}$ are independent within-level and between-level components, respectively. The measurement model at the within-level is given by the equation:

$${y}_{ig}={\mu}_g+{\varLambda}_W{\eta}_{W_{ig}}+{\epsilon}_{W_{ig}}$$

(3)

where Λ_W is the p × 2 factor loadings matrix for the within-level latent factor (${\eta}_{W_{ig}}$), ${\eta}_{W_{ig}}$ vector is the distributed multivariate normal with an expectation of zero and is a 2 × 2 covariance matrix Ψ_W. The ${\epsilon}_{W_{ig}}$ is multivariate normal distributed with an expectation of zero and p × p diagonal covariance matrix Θ_W, with error terms along the diagonal.

The measurement model at the between-level is given by

$${\mu}_g=\mu +{\varLambda}_B{\eta}_{B_g}+{\epsilon}_{B_g}$$

(4)

Here, Λ_B, ${\eta}_{B_g}$, and ${\epsilon}_{B_g}$ are the between-level terms corresponding to the within-level terms Λ_W, ${\eta}_{W_{ig}}$, and ${\epsilon}_{W_{ig}}$. Moreover, the covariance matrices Ψ_B and Θ_B are the between-level counterparts to the within-level covariance matrices Ψ_W and Θ_W. We can obtain Eq. (5) after combining Eqs. (3) and (4):

$${y}_{ig}=\mu +{\varLambda}_W{\eta}_{W_{ig}}+{\varLambda}_B{\eta}_{B_g}+{\epsilon}_{W_{ig}}+{\epsilon}_{B_g}$$

(5)

Estimator

The diagonally weighted least squares (DWLS) estimator is recommended for single-level confirmatory factor analysis (CFA) with categorical indicators (DiStefano & Morgan, 2014; Forero et al., 2009) and multilevel CFA due to its ability to identify the correct model specification (Padgett & Morgan, 2020). The DWLS estimator was based on polychoric correlation and the inverse of the asymptotic covariance matrix W^-1 of the sample variances and covariances as a weight matrix. Because the estimation of W^-1 is quite unstable when the sample size is small, the DWLS uses only the diagonal elements of W in model fitting and uses full W to obtain standard errors and χ² values (Jöreskog & Sörbom, 1996). As a result, the DWLS produces robust standard errors and χ² values (Flora & Curran, 2004; Muthen, 1993). Finney et al., (2006) suggested that when fewer than five categories are used, the DWLS estimator resulted in robust parameter estimates, standard errors, and fit indices for models with categorical nature of the data. In addition, the DWLS estimator has been found to perform well with small sample sizes and large models (Flora & Curran, 2004; Yang-Wallentin et al., 2010). Beauducel and Herzberg (2006) found that DWLS produced fit indices (root mean square error of approximation [RMSEA], comparative fit index [CFI], Tucker-Lewis Index [TLI]) that adequately indicated correctly specified models. However, when data were non-normally distributed in two to four categories, Bandalos (2008) found that robust DWLS-based RMSEA and CFI inadequately identified poorly misspecified models. Note that the DWLS estimator is known as weighted least squares mean and variance adjusted (WLSMV) estimator in Mplus. In the present study, the DWLS estimator was adopted for analyzing simulated datasets by using the command “estimator = WLSMV” in Mplus.

Cut-off values for using fit indices on CFA with dichotomous indicators

Several simulation studies have examined whether Hu and Bentler’s (1999) conventional cut-off values for traditional fit indices (i.e., RMSEA < .06; CFI and TLI > .95; SRMR < .08) can be applied similarly when DWLS is used to ordered categorical data. In general, prior studies suggested that conventional cut-off values should be applied with careful consideration of the data’s characteristics, such as sample size, asymmetry of categorical data, and number of categories (DiStefano & Morgan, 2014; Garrido, Abad, & Ponsoda, 2016; Nye & Drasgow, 2011; Xia & Yang, 2019). To the best of our knowledge, Padgett and Morgan (2021) is the only study that provided recommended cut-off criteria for using traditional fit indices in MCFA with categorical indicators. Specifically, Padgett and Morgan found that CFI, TLI, and RMSEA were primarily influenced by within-level misspecification, but partially influenced by between-level misspecification. In addition, the performance of the three fit indices was impacted by the sample size at the between-level (N₂) when DWLS was used. Although they provide recommended cut-off criteria for CFI and TLI (> .98 if N₂ < 100; > .97 if N₂ ≥ 100), and RMSEA (< .02 regardless of N₂), they also cautioned that those fit indices should be used only to provide weak evidence for some type of misspecification. Alternatively, SRMR_w and SRMR_B may provide within-level and between-level model-data fit, respectively. When DWLS is used, SRMR_w needs a cut-off value of < .05 if N₂ < 100, and a cut-off value of < .04 if N₂ ≥ 100. On the other hand, SRMR_B is suggested to have a cut-off value of < .06 if N₂ ≥ 100 and should not be used if N₂ < 100.

Model evaluation: PS-level-specific fit indices

Numerous simulation studies have indicated that level-specific tests of exact fit (i.e., χ² test statistics) and fit indices derived from the partially saturated-model method were recommended for detecting misspecification in MCFA (Hsu et al., 2017; Lee & Sohn, 2022; Rappaport et al., 2020; Ryu, 2011, 2014; Ryu & West, 2009; Schermelleh-Engel et al., 2014). Using the PS method, researchers can first derive the between-level-specific (b-l-s) χ² test statistics (${\chi}_{PS\_B}^2$). Specifically, ${\chi}_{PS\_B}^2$ can be derived by specifying a hypothesized between-level model and saturating the within-level model (i.e., correlating all observed variables). A saturated within-level models can be seen as a just-identified model with zero degrees of freedom, and thus has a χ² test statistic equal to zero. Consequently, ${\chi}_{PS\_B}^2$ only reflects the model fit of the hypothesized between-level model. Since fit indices are a function of χ² test statistics, researchers can then compute b-l-s fit indices (RMSEA_{PS_B}, CFI_{PS_B}, TLI_{PS_B}) using the value of ${\chi}_{PS\_B}^2$.

In the same manner, within-level-specific (w-l-s) χ² test statistics (${\chi}_{PS\_W}^2$) can be derived by specifying a hypothesized within-level model and saturating the between-level model. The w-l-s fit indices (RMSEA_{PS_W}, CFI_{PS_W}, TLI_{PS_W}) can be computed using the value of ${\chi}_{PS\_W}^2$. The formulas for computing l-s fit indices are identical to those in Ryu and West (2009) and Hsu, Lin, Skidmore, and Kim (2018) studies (see Appendix A). Additionally, the performance of the aforementioned l-s fit indices was compared with that of two alternative l-s fit indices, SRMR_W and SRMR_B, which are computed based on the discrepancy between the sample covariance and the corresponding model-implied covariance. Both SRMR_W and SRMR_B are available in the Mplus model solution output.

Previous simulation studies on L-s fit indices

Prior simulation studies have investigated the performance of l-s fit indices when the multivariate normality assumption was met in MCFA. Ryu and West (2009) examined the effectiveness of b-l-s fit indices (RMSEA_{PS_B}, CFI_{PS_B}) and w-l-s fit indices (RMSEA_{PS_W}, CFI_{PS_W}) using a population MCFA model with continuous indicators where multivariate normality was assumed. The intraclass correlation coefficient (ICC) level in the population model was fixed to 0.5. Ryu and West (2009) found that both types of fit indices can correctly indicate good and poor model fit in various sample size conditions (numbers of groups = 50, 100, 200, and 1000, and group size = 20, 50, 100). Ryu and West (2009), however, also found that a sample size of 50 groups could cause nonconvergence problems. Ryu and West’s findings were validated by Rappaport et al. (2020).

Hsu et al. (2017) extended the findings of Ryu and West (2009) by investigating the performance of l-s fit indices when ICC was smaller than 0.5. Hsu et al. (2017) found that the performance of w-l-s fit indices (RMSEA_{PS_W}, CFI_{PS_W}, TLI_{PS_W}, and SRMR_w) was barely influenced by ICC, while b-l-s fit indices (RMSEA_{PS_B}, CFI_{PS_B}, TLI_{PS_B}, and SRMR_B) were less-promising indicators to correctly indicate good or poor model fit. Similarly, Lee and Sohn (2022) found that w-l-s fit indices were sensitive to detecting misspecified within-level models and were less impacted by ICC and sample size. In addition, Lee and Sohn discovered that RMSEA_{PS_B} and SRMR_B were more promising for detecting misspecified between-level models with an increase in ICC, while CFI_{PS_B} and TLI_{PS_B} were also influenced by ICC, but the influence was moderated by the type of misspecifications in models (e.g., misspecification in factor cross-loadings or factor covariance). In summary, previous studies have considered various design factors, such as numbers of groups, group size, ICC, and type of misspecifications.

To the best of our knowledge, the performance of l-s fit indices in MCFA with categorical indicators has not been well examined in prior research. Hsu (2009) examined the sensitivity of SRMR_W and SRMR_B in MCFA with dichotomous indicators. Hsu (2009) found that SRMR_W can correctly indicate good model fit (i.e., type I error rate < .05) and poor model fit (i.e., statistical power > .80) due to an intentional misspecification in factor covariance. The type I error rate of SRMR_B, however, tended to be higher across different conditions, and the statistical power was less satisfying when ICC was low. Similar findings about SRMR_W and SRMR_B were revealed in Navruz’s (2016) study.

The present study

To date, no studies have attempted to evaluate the performance of l-s fit indices derived from the PS method in MCFA with categorical indicators. To address this concern, the present study attempted to evaluate the performance of commonly used l-s fit indices using a population model with dichotomous indicators. The findings of this study may shed some light on the practice of model evaluation when categorical indicators are used in MCFA.

The design factors considered in the present study included numbers of groups, group size, ICC, as well as two other factors, thresholds of dichotomous indicators and factor loadings, which have not been widely investigated in previous simulation studies focusing on effectiveness of l-s fit indices. Asymmetric thresholds of dichotomous indicators could occur in real-world situations, and prior studies have discovered the impact of asymmetric thresholds on the performance of several scaled or adjusted χ² statistics (hereafter called adjusted χ² statistics). For example, Rhemtulla, Brosseau-Liard, and Savalei (2012) examined three different conditions of thresholds (50:50, 60:40, and 80:20) and found that adjusted χ² statistics had relatively low statistical power in detecting serious model misspecification with asymmetric thresholds and small samples. In addition, type I error rates were reasonable (smaller than .05) when the threshold was symmetric, but extreme asymmetry thresholds (80:20) could cause high type I error rates. Savalei and Rhemtulla (2013) conducted simulations based on a two-factor model with dichotomous indicators using a DWLS estimator. They examined the difference between symmetric thresholds (50:50) and asymmetric thresholds (64:36 and 85:15) on adjusted χ² statistics examined in Rhemtulla et al. (2012). Results suggested that the performance of adjusted χ² statistics decreases as thresholds become more asymmetric. To the best of our knowledge, the impact of asymmetric thresholds on l-s fit indices has not been well documented. Because fit indices were a function of χ² statistics, the performance of fit indices could very likely be influenced by the thresholds of dichotomous indicators. For this reason, we examined the impact of asymmetric thresholds on level-specific fit indices in MCFA with dichotomous indicators.

In addition, previous simulation studies (e.g., Forero et al., 2009; Garrido et al., 2016; Heene et al., 2011; Nestler, 2013) have consistently shown that CFAs with low factor loadings resulted in less adequate model evaluation results. Nestler (2013) examined the impact of factor loadings—high (0.70), medium (0.55), and low (0.40)—on χ² statistics with a two-factor CFA with dichotomous indicators. Nestler found that when factor loadings were high, the statistical power of χ² test statistics was satisfying (~>.80), regardless of sample size. However, when factor loadings were medium or low, a sample size of 250 or 500, respectively, was needed to retain satisfactory statistical power. Furthermore, previous simulation studies have shown that the magnitudes of factor loadings had an impact on the performance of traditional fit indices. For example, Heene et al. (2011) examined how the performance of the χ² test statistic, RMSEA, CFI, and SRMR could be influenced by three levels of factor loadings [low ~(0.30, 0.50); medium ~(0.50, 0.70); high ~(0.70, 0.90)]. Heene et al., 2011 found that decreasing factor loadings led to decreasing values of the χ² test statistic and three fit indices, altering the statistical power to detect misspecified models. In general, low factor loadings resulted in decreasing values of χ² test statistic and RMSEA, CFI, and SRMR. As a result, misspecification was often not detected by the χ² test statistic, RMSEA, or SRMR when factor loadings were low. However, high factor loadings can cause the rejection of just slightly misspecified models (cf. Savalei, 2012). In contrast, Heene et al. , 2011 found that the CFI tended to exhibit poorer fit for models with low factor loadings. The reason is that models with lower factor loadings held lower covariances between the observed variables. As a result, the distance between the hypothesized model and the baseline null model would be reduced, resulting in lower CFI values (Garrido et al., 2016). The impact of low factor loadings on CFI can be extended to TLI because both fit indices are a function of the distance between the hypothesized model and the baseline null model. To date, few efforts have been made to study the impact of factor loadings on l-s fit indices in MCFA with dichotomous indicators. Our study aimed to bridge this knowledge gap.

Note that although we were aware that type of misspecification (MT) can be manipulated as a design factor, our study only considered one MT condition, which was the misspecification in factor covariance. Our selected misspecification scenario is important for applied researchers who wish to verify the construct validity of their instruments. We did not include the scenario of misspecification in cross-loadings as another MT condition because we think misspecification in cross-loadings is a highly complex topic (e.g., over-specification, under-specification, magnitudes of factor cross-loadings) which deserves a new study to comprehensively investigate this issue.

Method

A Monte Carlo study was conducted to evaluate the performance of l-s fit indices in MCFA with dichotomous indicators. The five design factors examined in this study were numbers of groups, group size, intraclass correlation coefficient, thresholds of dichotomous indicators, and factor loadings.

Population model

In the current study, the population model (see Fig. 1) for simulation data generation was based on the population model presented in Hsu et al.’s (2017) study. The population model was a two-level measurement model with two within-level factors (η_W1and η_W2) and two between-level factors (η_B1and η_B2). At the within-level, five dichotomous observed indicators were loaded on each factor. Parameters in the within-level model for simulation data generation were factor loadings = 0.70, factor variances = 1.00, and factor covariance = 0.30. The residual variance parameters cannot be freely estimated; therefore, no initial residual variances were set (Muthen, 1990). Factors and residual variances were independent of each other. Note that the correlation between two within-level factors was also 0.30 because within-level factor variances were fixed at 1.00. The threshold of indicators was equal to 0, resulting in a 50:50 proportion of responses that are 0 or 1.

The between-level model had an identical factorial structure to the within-level model. Parameters for simulation data generation were factor loadings = 0.70, residual variances = 0.51. Factors and residuals were independent of each other. We varied the variance of between-level factors (a in Fig. 1) to create three ICC conditions. (More detailed information is presented in the next section.) Note that for each ICC condition, we adjusted the between-level factor covariance (b in Fig. 1) based on the formula: 0.30 $\times \sqrt{\mathit{\operatorname{var}}\left({\eta}_{B1}\right)}\times \sqrt{\mathit{\operatorname{var}}\left({\eta}_{B2}\right)}$ so that the correlation of two between-level factors can be held to 0.30 across different ICC conditions.

Design factors

Numbers of groups (NG)

A NG larger than 100 was recommended as an acceptable estimate at the between-level with low ICC (Hox & Maas, 2001; Hsu et al., 2015). Ryu and West (2009) used NG = 50, 100, 200, and 1000 in their simulation work, where NG = 50 could lead to nonconvergence problems and NG = 1000 was not a realistic NC for practitioners. Hsu et al. (2017) found that NG = 200 seemed to compensate for convergence problems if ICC was low. As a result, this study considered NG ranging from 50 to 200. Specifically, the current study adopted three NG conditions (50, 100, and 200) to evaluate the impact of NC on the performance of level-specific fit indices when the indicators of MCFA were dichotomous.

Group size (GS)

Hox and Maas (2001) used a set of GS = 10, 20, 50 on a MSEM study and found that GS had a trivial impact on parameter estimates, as well as standard errors. Additionally, Ryu and West (2009) used GS = 20, 50, and 100 in their study focusing on performance of fit indices of MSEM. In consideration of common practices and comparability, the present study adopted GS = 10, 20, and 50.

ICC. The ICC (ρ) is defined as:

$$\rho =\frac{VAR_{Between}}{VAR_{Between}+{VAR}_{Within}}$$

(6)

where VAR_Between and VAR_Within are the variance of between-level factors and within-level factors, respectively. The variance of within-level factors was constrained to 1.00, while that of between-level factors varied to create different ICC conditions. The present study considered three levels of between-level variance (0.1, 0.3, and 0.5), resulting in three levels of ICC: .091 (ICC1), .231 (ICC2), and .333 (ICC3).

Thresholds of dichotomous indicators (THR)

To the best of our knowledge, study of the impact of asymmetric thresholds on l-s fit indices is still lacking. Guided by Rhemtulla et al.’s (2012) and Savalei and Rhemtulla’s (2013) studies, we considered both symmetric (50:50) and asymmetric (80:20) conditions. By considering these design factors, we intended to understand the impact of asymmetry of categorical data on the performance of l-s χ² and fit indices.

Factor loadings (FL)

Following Nestler (2013) simulation study, we adopted three conditions of FL in this study: low (.40), medium (.55), and high (.70). The setting of FL was also in line with previous simulation studies (Forero et al., 2009; Garrido et al., 2016; Heene et al., 2011). Previous simulation studies have shown that low magnitudes of factor loadings resulted in less sensitivity of traditional fit indices. By including this design factor, we intended to examine the extent to which the magnitudes of factor loadings can impact the performance of l-s χ² and fit indices.

As a result, a total of 162 conditions (NG = 50, 100, and 200; GS = 10, 20, and 50; ICC = ICC1 to ICC3; THR = 50:50 and 80:20; FL = .40, .55, and .70) were yielded. For each condition, 500 replications were generated using Mplus 7.4 (Muthen & Muthen, 1998–2015).

Intentional misspecifications in the hypothesized models

After simulation data were generated, we analyzed simulation data in three different conditions: (a) correctly specified hypothesized model (i.e., the hypothesized model was equal to the population model as shown in Fig. 1, M_C); (b) misspecification in between-level model only (M_B, see Fig. 2); and (c) misspecification in within-level model only (M_W, see Fig. 2). Following Ryu and West’s (2009) and Hsu et al.’s (2017) studies, M_B contained a misspecification where only one between-level factor loaded on the indicators. M_W contained a misspecification where only one within-level factor loaded on the indicators. The WLSMV estimator was applied in three conditions to obtain the model solutions in Mplus. Starting values of parameter estimates were set to the same as the parameter values in the population model to prevent any convergence problems due to bad starting values. Fit indices of interest were computed in each condition.

Analysis

In M_C, M_B, and M_W conditions, the values of fit indices were saved for subsequent analyses. Deceptive statistics for level-specific χ² test statistics and fit indices were computed and reported. If needed, factorial analysis of variance (ANOVA) was applied to determine the impact of the design factors on the performance of fit indices (Skrondal, 2000). Specifically, eta-squared (η²) was reported to indicate the proportion of the variance accounted for by a particular design factor or the interaction effect terms. Following Cohen’s (1988, 1992) suggestion, we used a moderate η² above .06 (i.e., practically significant) to identify influential design factors in the fit indices values. Note that when a fit index had a standard deviation close to 0, the impact of design factors on the values of the fit index were self-evidently trivial, regardless of the η² values. In addition, for level-specific χ² test statistics, we computed the type I error rates (under M_C) or statistical power (under M_B and M_W) under α level = .05. For level-specific fit indices, we applied the traditional cut-off values (RMSEA-related fit indices < .06; CFI- and TLI-related fit indices > .95; SRMR-related fit indices < .08; Hu & Bentler, 1999) to explore whether these fit indices were promising to indicate correctly specified or misspecified hypothesized models. Note that our intention was not to encourage using traditional cut-off values as golden rules for model evaluation. Rather, we intended to provide a sense of whether these cut-off values are applicable when level-specific χ² test statistics or fit indices are used.

Results

Convergence rates

The convergence rates were highly associated with magnitude of ICC and FL, followed by sample size (NG and GS). In contract, THR was less influential. In this section, the convergence rates were reported by ICC and FL. First, the convergence rates were above 95% in conditions with high ICC (.333) and high FL (.70), with one exception (the convergence rate was 92% when NG = 50, GS = 10, and THR = 80:20). In conditions with high ICC and medium FL (.55), the coverage rates were close to or above 95% only when NG/GS were at least 100/20 or NG/GS were at least 200/10. On the other hand, in conditions with high ICC and low FL (.40), the coverage rates were below 95% (ranging between 46% and 89%). Second, in conditions with medium ICC (.231) and high FL, the convergence rates were above 95% when NG/GS were at least 100/10 and THR was symmetric. In contrast, in conditions with medium ICC and medium FL, the convergence rates were above 95% only when NG/GS were at least 200/50. Unfortunately, in conditions with medium ICC and low FL, the coverage rates were below 95% (ranging between 37% and 63%). Third, in conditions with low ICC (.091), the coverage rates were below 95% – in range of 48% and 84% when FL was high; in range of 40% and 52% when FL was medium; and in range of 29% and 47% when FL was low. Only converged solutions were used for further analysis.

Effects of design factors on level-specific fit indices

The left-hand side of Table 1 summarizes the descriptive statistics [aggregated means and standard deviations (SDs)] of level-specific χ² test statistics and fit indices for the M_C, M_B, and M_W conditions. The right-hand side of Table 1. shows the η² values derived from ANOVA results using level-specific χ² test statistics and fit indices as outcomes. We have highlighted η² values above .06 (Cohen, 1988, 1992) in grey to show practically significant effects of design factors. Note that when a level-specific χ² test statistic or fit index has a small variation (indicated by a small SD), the impact of design factors was self-evidently trivial even if any η² exceeding .06 is identified. Note the η² values of three-way interactions were close to 0, and therefore, were not reported in Table 1 for the sake of simplicity. In addition, to inform the performance of level-specific χ² test statistics and fit indices, we report the average values of level-specific χ² test statistics and fit indices by all design factors in Appendix B (Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21), where type I error rates or statistical power are reported for level-specific χ² test statistics.

Table 1 ANOVA results (η²) with χ² test statistics and fit indices values as the dependent variables for the * NG * GS * ICC * THR * FL design

Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Abstract

Similar content being viewed by others

What is Qualitative in Qualitative Research

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

How to use and assess qualitative research methods

Introduction

MCFA with dichotomous indicators in multilevel structural equation modeling

Estimator

Cut-off values for using fit indices on CFA with dichotomous indicators

Model evaluation: PS-level-specific fit indices

Previous simulation studies on L-s fit indices

The present study

Method

Population model

Design factors

Numbers of groups (NG)

Group size (GS)

Thresholds of dichotomous indicators (THR)

Factor loadings (FL)

Intentional misspecifications in the hypothesized models

Analysis

Results

Convergence rates

Effects of design factors on level-specific fit indices

Between-level-specific χ 2 and fit indices

RMSEAB

CFI B and TLI B

SRMRB

Within-level-specific χ 2 and fit indices

RMSEAW

CFI W and TLI W

SRMRW

Discussion

Concluding remarks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A. The formulas for the level-specific fit indexes

Appendix B. Results of the additional simulations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Between-level-specific χ ² and fit indices

RMSEA_B

CFI _B and TLI _B

SRMR_B

Within-level-specific χ ² and fit indices

RMSEA_W

CFI _W and TLI _W

SRMR_W