Introduction

Multilevel confirmatory factor analysis (MCFA), or multilevel measurement modeling, is an imperative statistical approach to validate latent constructs that underlie multiple item responses collected in multilevel settings (e.g., students nested within schools). MCFA has been widely used in psychological and educational research, and relevant reporting guidelines have been made available for substantive researchers (Kim et al., 2016). Unlike conventional confirmatory factor analysis, the construct of interest in MCFA could be at multiple levels, resulting in challenges for substantive researchers in model specification, evaluation, and interpretation. Due to this complexity, extensive efforts have been devoted to providing instructions for appropriate specification and interpretation of MCFA (Stapleton, McNeish et al., 2016a; Stapleton, Yang et al., 2016b).

Regarding model evaluation, a body of research has shown that traditional fit indices [e.g., root mean square error of approximation (RMSEA), comparative fit index (CFI), Tucker–Lewis Index (TLI), and standardized root mean square residual (SRMR)] are not sensitive to misspecified between-level models in multilevel structural equation modeling (MSEM) (Hsu et al., 2017; Padgett & Morgan, 2020). For this reason, researchers have advocated the use of level-specific (l-s) fit indices for evaluating the within-level model and the between-level model separately (Hox, 2010; Hsu et al., 2017; Rappaport et al., 2020; Ryu, 2014; Ryu & West, 2009; Schermelleh-Engel et al., 2014; Wu et al., 2017). According to Ryu and West (2009), the l-s fit indices could be straightforwardly computed by the partially saturated-model method (PS method). Note that the existing guidelines for using level-specific fit indices are based on population multilevel measurement models with continuous indicators, and few address ordered categorical variables (Padgett & Morgan, 2020). Yet it is unclear whether these guidelines could be applicable to models with dichotomous indicators.

To shed light on this issue, this study endeavors to examine the performance of level-specific χ2 test statistics and fit indices derived from the PS method in terms of their sensitivity to lack of fit at specific levels in MCFA with dichotomous indicators. In addition, the effectiveness of alternative level-specific fit indices obtained from Mplus—SRMR for the within-level model (SRMRw) and for the between-level model (SRMRB)—was compared with PS-level-specific fit indices.

MCFA with dichotomous indicators in multilevel structural equation modeling

For simplicity, we consider a two-level single-factor model. Let Ypig denote the pth dichotomous indicator (i.e., latent variable indicator) for individual i nested within g group (p = 1…P dichotomous indicators, i = 1…N individuals, and g = 1…G groups),

$${Y}_{pig}=\left\{\begin{array}{c}1, if\ {y}_{pig}^{\ast }>\tau \\ {}0, if\ {y}_{pig}^{\ast}\le \tau \end{array}\right.$$
(1)

The equation expresses a threshold model which assumes that underlying the dichotomous indicator Ypig is a normally distributed continuous latent variable\({y}_{pig}^{\ast }\), which can determine the category of dichotomous indicator by the threshold (τ) (Asparouhov & Muthen, 2007; Bollen, 2002). That is, the indicators of interest are conceptualized as continuous, but the format of response to each indicator is in a restrictive, dichotomous scale (Bollen, 2002). For example, if the ith individual falls short of the threshold, the response of this individual would be 0. If the ith individual passes the threshold, the response of this individual would be 1.

Using similar notations to those used by Padgett and Morgan (2021), in this section, we outline a two-level measurement model with dichotomous indicators, as illustrated in Fig. 1. Note that the model in Fig. 1 was also adopted as a population model for simulated dataset generation in the current study. Using the between-and-within specification approach (Hox, 2010; Muthen, 1994), the covariance structure is partitioned into a within-level component (denoted by a W subscript) and a between-level component (denoted by a B subscript). Separate models are specified for each component. The within-level component captures individual-level variation, while the between-level component captures variation between groups.

Fig. 1
figure 1

A two-level data generating model with two factors at each level

Let yig denote the p-dimensional response vector for student i in group g. The response vector yig is decomposed as:

$${y}_{ig}=\mu +{y}_{w_{ig}}+{y}_{B_g}$$
(2)

where μ represents the grand mean, and \({y}_{w_{ig}}\) and \({y}_{B_g}\) are independent within-level and between-level components, respectively. The measurement model at the within-level is given by the equation:

$${y}_{ig}={\mu}_g+{\varLambda}_W{\eta}_{W_{ig}}+{\epsilon}_{W_{ig}}$$
(3)

where ΛW is the p × 2 factor loadings matrix for the within-level latent factor (\({\eta}_{W_{ig}}\)), \({\eta}_{W_{ig}}\) vector is the distributed multivariate normal with an expectation of zero and is a 2 × 2 covariance matrix ΨW. The \({\epsilon}_{W_{ig}}\) is multivariate normal distributed with an expectation of zero and p × p diagonal covariance matrix ΘW, with error terms along the diagonal.

The measurement model at the between-level is given by

$${\mu}_g=\mu +{\varLambda}_B{\eta}_{B_g}+{\epsilon}_{B_g}$$
(4)

Here, ΛB, \({\eta}_{B_g}\), and \({\epsilon}_{B_g}\) are the between-level terms corresponding to the within-level terms ΛW, \({\eta}_{W_{ig}}\), and \({\epsilon}_{W_{ig}}\). Moreover, the covariance matrices ΨB and ΘB are the between-level counterparts to the within-level covariance matrices ΨW and ΘW. We can obtain Eq. (5) after combining Eqs. (3) and (4):

$${y}_{ig}=\mu +{\varLambda}_W{\eta}_{W_{ig}}+{\varLambda}_B{\eta}_{B_g}+{\epsilon}_{W_{ig}}+{\epsilon}_{B_g}$$
(5)

Estimator

The diagonally weighted least squares (DWLS) estimator is recommended for single-level confirmatory factor analysis (CFA) with categorical indicators (DiStefano & Morgan, 2014; Forero et al., 2009) and multilevel CFA due to its ability to identify the correct model specification (Padgett & Morgan, 2020). The DWLS estimator was based on polychoric correlation and the inverse of the asymptotic covariance matrix W-1 of the sample variances and covariances as a weight matrix. Because the estimation of W-1 is quite unstable when the sample size is small, the DWLS uses only the diagonal elements of W in model fitting and uses full W to obtain standard errors and χ2 values (Jöreskog & Sörbom, 1996). As a result, the DWLS produces robust standard errors and χ2 values (Flora & Curran, 2004; Muthen, 1993). Finney et al., (2006) suggested that when fewer than five categories are used, the DWLS estimator resulted in robust parameter estimates, standard errors, and fit indices for models with categorical nature of the data. In addition, the DWLS estimator has been found to perform well with small sample sizes and large models (Flora & Curran, 2004; Yang-Wallentin et al., 2010). Beauducel and Herzberg (2006) found that DWLS produced fit indices (root mean square error of approximation [RMSEA], comparative fit index [CFI], Tucker-Lewis Index [TLI]) that adequately indicated correctly specified models. However, when data were non-normally distributed in two to four categories, Bandalos (2008) found that robust DWLS-based RMSEA and CFI inadequately identified poorly misspecified models. Note that the DWLS estimator is known as weighted least squares mean and variance adjusted (WLSMV) estimator in Mplus. In the present study, the DWLS estimator was adopted for analyzing simulated datasets by using the command “estimator = WLSMV” in Mplus.

Cut-off values for using fit indices on CFA with dichotomous indicators

Several simulation studies have examined whether Hu and Bentler’s (1999) conventional cut-off values for traditional fit indices (i.e., RMSEA < .06; CFI and TLI > .95; SRMR < .08) can be applied similarly when DWLS is used to ordered categorical data. In general, prior studies suggested that conventional cut-off values should be applied with careful consideration of the data’s characteristics, such as sample size, asymmetry of categorical data, and number of categories (DiStefano & Morgan, 2014; Garrido, Abad, & Ponsoda, 2016; Nye & Drasgow, 2011; Xia & Yang, 2019). To the best of our knowledge, Padgett and Morgan (2021) is the only study that provided recommended cut-off criteria for using traditional fit indices in MCFA with categorical indicators. Specifically, Padgett and Morgan found that CFI, TLI, and RMSEA were primarily influenced by within-level misspecification, but partially influenced by between-level misspecification. In addition, the performance of the three fit indices was impacted by the sample size at the between-level (N2) when DWLS was used. Although they provide recommended cut-off criteria for CFI and TLI (> .98 if N2 < 100; > .97 if N2 ≥ 100), and RMSEA (< .02 regardless of N2), they also cautioned that those fit indices should be used only to provide weak evidence for some type of misspecification. Alternatively, SRMRw and SRMRB may provide within-level and between-level model-data fit, respectively. When DWLS is used, SRMRw needs a cut-off value of < .05 if N2 < 100, and a cut-off value of < .04 if N2 ≥ 100. On the other hand, SRMRB is suggested to have a cut-off value of < .06 if N2 ≥ 100 and should not be used if N2 < 100.

Model evaluation: PS-level-specific fit indices

Numerous simulation studies have indicated that level-specific tests of exact fit (i.e., χ2 test statistics) and fit indices derived from the partially saturated-model method were recommended for detecting misspecification in MCFA (Hsu et al., 2017; Lee & Sohn, 2022; Rappaport et al., 2020; Ryu, 2011, 2014; Ryu & West, 2009; Schermelleh-Engel et al., 2014). Using the PS method, researchers can first derive the between-level-specific (b-l-s) χ2 test statistics (\({\chi}_{PS\_B}^2\)). Specifically, \({\chi}_{PS\_B}^2\) can be derived by specifying a hypothesized between-level model and saturating the within-level model (i.e., correlating all observed variables). A saturated within-level models can be seen as a just-identified model with zero degrees of freedom, and thus has a χ2 test statistic equal to zero. Consequently, \({\chi}_{PS\_B}^2\) only reflects the model fit of the hypothesized between-level model. Since fit indices are a function of χ2 test statistics, researchers can then compute b-l-s fit indices (RMSEAPS_B, CFIPS_B, TLIPS_B) using the value of \({\chi}_{PS\_B}^2\).

In the same manner, within-level-specific (w-l-s) χ2 test statistics (\({\chi}_{PS\_W}^2\)) can be derived by specifying a hypothesized within-level model and saturating the between-level model. The w-l-s fit indices (RMSEAPS_W, CFIPS_W, TLIPS_W) can be computed using the value of \({\chi}_{PS\_W}^2\). The formulas for computing l-s fit indices are identical to those in Ryu and West (2009) and Hsu, Lin, Skidmore, and Kim (2018) studies (see Appendix A). Additionally, the performance of the aforementioned l-s fit indices was compared with that of two alternative l-s fit indices, SRMRW and SRMRB, which are computed based on the discrepancy between the sample covariance and the corresponding model-implied covariance. Both SRMRW and SRMRB are available in the Mplus model solution output.

Previous simulation studies on L-s fit indices

Prior simulation studies have investigated the performance of l-s fit indices when the multivariate normality assumption was met in MCFA. Ryu and West (2009) examined the effectiveness of b-l-s fit indices (RMSEAPS_B, CFIPS_B) and w-l-s fit indices (RMSEAPS_W, CFIPS_W) using a population MCFA model with continuous indicators where multivariate normality was assumed. The intraclass correlation coefficient (ICC) level in the population model was fixed to 0.5. Ryu and West (2009) found that both types of fit indices can correctly indicate good and poor model fit in various sample size conditions (numbers of groups = 50, 100, 200, and 1000, and group size = 20, 50, 100). Ryu and West (2009), however, also found that a sample size of 50 groups could cause nonconvergence problems. Ryu and West’s findings were validated by Rappaport et al. (2020).

Hsu et al. (2017) extended the findings of Ryu and West (2009) by investigating the performance of l-s fit indices when ICC was smaller than 0.5. Hsu et al. (2017) found that the performance of w-l-s fit indices (RMSEAPS_W, CFIPS_W, TLIPS_W, and SRMRw) was barely influenced by ICC, while b-l-s fit indices (RMSEAPS_B, CFIPS_B, TLIPS_B, and SRMRB) were less-promising indicators to correctly indicate good or poor model fit. Similarly, Lee and Sohn (2022) found that w-l-s fit indices were sensitive to detecting misspecified within-level models and were less impacted by ICC and sample size. In addition, Lee and Sohn discovered that RMSEAPS_B and SRMRB were more promising for detecting misspecified between-level models with an increase in ICC, while CFIPS_B and TLIPS_B were also influenced by ICC, but the influence was moderated by the type of misspecifications in models (e.g., misspecification in factor cross-loadings or factor covariance). In summary, previous studies have considered various design factors, such as numbers of groups, group size, ICC, and type of misspecifications.

To the best of our knowledge, the performance of l-s fit indices in MCFA with categorical indicators has not been well examined in prior research. Hsu (2009) examined the sensitivity of SRMRW and SRMRB in MCFA with dichotomous indicators. Hsu (2009) found that SRMRW can correctly indicate good model fit (i.e., type I error rate < .05) and poor model fit (i.e., statistical power > .80) due to an intentional misspecification in factor covariance. The type I error rate of SRMRB, however, tended to be higher across different conditions, and the statistical power was less satisfying when ICC was low. Similar findings about SRMRW and SRMRB were revealed in Navruz’s (2016) study.

The present study

To date, no studies have attempted to evaluate the performance of l-s fit indices derived from the PS method in MCFA with categorical indicators. To address this concern, the present study attempted to evaluate the performance of commonly used l-s fit indices using a population model with dichotomous indicators. The findings of this study may shed some light on the practice of model evaluation when categorical indicators are used in MCFA.

The design factors considered in the present study included numbers of groups, group size, ICC, as well as two other factors, thresholds of dichotomous indicators and factor loadings, which have not been widely investigated in previous simulation studies focusing on effectiveness of l-s fit indices. Asymmetric thresholds of dichotomous indicators could occur in real-world situations, and prior studies have discovered the impact of asymmetric thresholds on the performance of several scaled or adjusted χ2 statistics (hereafter called adjusted χ2 statistics). For example, Rhemtulla, Brosseau-Liard, and Savalei (2012) examined three different conditions of thresholds (50:50, 60:40, and 80:20) and found that adjusted χ2 statistics had relatively low statistical power in detecting serious model misspecification with asymmetric thresholds and small samples. In addition, type I error rates were reasonable (smaller than .05) when the threshold was symmetric, but extreme asymmetry thresholds (80:20) could cause high type I error rates. Savalei and Rhemtulla (2013) conducted simulations based on a two-factor model with dichotomous indicators using a DWLS estimator. They examined the difference between symmetric thresholds (50:50) and asymmetric thresholds (64:36 and 85:15) on adjusted χ2 statistics examined in Rhemtulla et al. (2012). Results suggested that the performance of adjusted χ2 statistics decreases as thresholds become more asymmetric. To the best of our knowledge, the impact of asymmetric thresholds on l-s fit indices has not been well documented. Because fit indices were a function of χ2 statistics, the performance of fit indices could very likely be influenced by the thresholds of dichotomous indicators. For this reason, we examined the impact of asymmetric thresholds on level-specific fit indices in MCFA with dichotomous indicators.

In addition, previous simulation studies (e.g., Forero et al., 2009; Garrido et al., 2016; Heene et al., 2011; Nestler, 2013) have consistently shown that CFAs with low factor loadings resulted in less adequate model evaluation results. Nestler (2013) examined the impact of factor loadings—high (0.70), medium (0.55), and low (0.40)—on χ2 statistics with a two-factor CFA with dichotomous indicators. Nestler found that when factor loadings were high, the statistical power of χ2 test statistics was satisfying (~>.80), regardless of sample size. However, when factor loadings were medium or low, a sample size of 250 or 500, respectively, was needed to retain satisfactory statistical power. Furthermore, previous simulation studies have shown that the magnitudes of factor loadings had an impact on the performance of traditional fit indices. For example, Heene et al. (2011) examined how the performance of the χ2 test statistic, RMSEA, CFI, and SRMR could be influenced by three levels of factor loadings [low ~(0.30, 0.50); medium ~(0.50, 0.70); high ~(0.70, 0.90)]. Heene et al., 2011 found that decreasing factor loadings led to decreasing values of the χ2 test statistic and three fit indices, altering the statistical power to detect misspecified models. In general, low factor loadings resulted in decreasing values of χ2 test statistic and RMSEA, CFI, and SRMR. As a result, misspecification was often not detected by the χ2 test statistic, RMSEA, or SRMR when factor loadings were low. However, high factor loadings can cause the rejection of just slightly misspecified models (cf. Savalei, 2012). In contrast, Heene et al. , 2011 found that the CFI tended to exhibit poorer fit for models with low factor loadings. The reason is that models with lower factor loadings held lower covariances between the observed variables. As a result, the distance between the hypothesized model and the baseline null model would be reduced, resulting in lower CFI values (Garrido et al., 2016). The impact of low factor loadings on CFI can be extended to TLI because both fit indices are a function of the distance between the hypothesized model and the baseline null model. To date, few efforts have been made to study the impact of factor loadings on l-s fit indices in MCFA with dichotomous indicators. Our study aimed to bridge this knowledge gap.

Note that although we were aware that type of misspecification (MT) can be manipulated as a design factor, our study only considered one MT condition, which was the misspecification in factor covariance. Our selected misspecification scenario is important for applied researchers who wish to verify the construct validity of their instruments. We did not include the scenario of misspecification in cross-loadings as another MT condition because we think misspecification in cross-loadings is a highly complex topic (e.g., over-specification, under-specification, magnitudes of factor cross-loadings) which deserves a new study to comprehensively investigate this issue.

Method

A Monte Carlo study was conducted to evaluate the performance of l-s fit indices in MCFA with dichotomous indicators. The five design factors examined in this study were numbers of groups, group size, intraclass correlation coefficient, thresholds of dichotomous indicators, and factor loadings.

Population model

In the current study, the population model (see Fig. 1) for simulation data generation was based on the population model presented in Hsu et al.’s (2017) study. The population model was a two-level measurement model with two within-level factors (ηW1and ηW2) and two between-level factors (ηB1and ηB2). At the within-level, five dichotomous observed indicators were loaded on each factor. Parameters in the within-level model for simulation data generation were factor loadings = 0.70, factor variances = 1.00, and factor covariance = 0.30. The residual variance parameters cannot be freely estimated; therefore, no initial residual variances were set (Muthen, 1990). Factors and residual variances were independent of each other. Note that the correlation between two within-level factors was also 0.30 because within-level factor variances were fixed at 1.00. The threshold of indicators was equal to 0, resulting in a 50:50 proportion of responses that are 0 or 1.

The between-level model had an identical factorial structure to the within-level model. Parameters for simulation data generation were factor loadings = 0.70, residual variances = 0.51. Factors and residuals were independent of each other. We varied the variance of between-level factors (a in Fig. 1) to create three ICC conditions. (More detailed information is presented in the next section.) Note that for each ICC condition, we adjusted the between-level factor covariance (b in Fig. 1) based on the formula: 0.30 \(\times \sqrt{\mathit{\operatorname{var}}\left({\eta}_{B1}\right)}\times \sqrt{\mathit{\operatorname{var}}\left({\eta}_{B2}\right)}\) so that the correlation of two between-level factors can be held to 0.30 across different ICC conditions.

Design factors

Numbers of groups (NG)

A NG larger than 100 was recommended as an acceptable estimate at the between-level with low ICC (Hox & Maas, 2001; Hsu et al., 2015). Ryu and West (2009) used NG = 50, 100, 200, and 1000 in their simulation work, where NG = 50 could lead to nonconvergence problems and NG = 1000 was not a realistic NC for practitioners. Hsu et al. (2017) found that NG = 200 seemed to compensate for convergence problems if ICC was low. As a result, this study considered NG ranging from 50 to 200. Specifically, the current study adopted three NG conditions (50, 100, and 200) to evaluate the impact of NC on the performance of level-specific fit indices when the indicators of MCFA were dichotomous.

Group size (GS)

Hox and Maas (2001) used a set of GS = 10, 20, 50 on a MSEM study and found that GS had a trivial impact on parameter estimates, as well as standard errors. Additionally, Ryu and West (2009) used GS = 20, 50, and 100 in their study focusing on performance of fit indices of MSEM. In consideration of common practices and comparability, the present study adopted GS = 10, 20, and 50.

ICC. The ICC (ρ) is defined as:

$$\rho =\frac{VAR_{Between}}{VAR_{Between}+{VAR}_{Within}}$$
(6)

where VARBetween and VARWithin are the variance of between-level factors and within-level factors, respectively. The variance of within-level factors was constrained to 1.00, while that of between-level factors varied to create different ICC conditions. The present study considered three levels of between-level variance (0.1, 0.3, and 0.5), resulting in three levels of ICC: .091 (ICC1), .231 (ICC2), and .333 (ICC3).

Thresholds of dichotomous indicators (THR)

To the best of our knowledge, study of the impact of asymmetric thresholds on l-s fit indices is still lacking. Guided by Rhemtulla et al.’s (2012) and Savalei and Rhemtulla’s (2013) studies, we considered both symmetric (50:50) and asymmetric (80:20) conditions. By considering these design factors, we intended to understand the impact of asymmetry of categorical data on the performance of l-s χ2 and fit indices.

Factor loadings (FL)

Following Nestler (2013) simulation study, we adopted three conditions of FL in this study: low (.40), medium (.55), and high (.70). The setting of FL was also in line with previous simulation studies (Forero et al., 2009; Garrido et al., 2016; Heene et al., 2011). Previous simulation studies have shown that low magnitudes of factor loadings resulted in less sensitivity of traditional fit indices. By including this design factor, we intended to examine the extent to which the magnitudes of factor loadings can impact the performance of l-s χ2 and fit indices.

As a result, a total of 162 conditions (NG = 50, 100, and 200; GS = 10, 20, and 50; ICC = ICC1 to ICC3; THR = 50:50 and 80:20; FL = .40, .55, and .70) were yielded. For each condition, 500 replications were generated using Mplus 7.4 (Muthen & Muthen, 1998–2015).

Intentional misspecifications in the hypothesized models

After simulation data were generated, we analyzed simulation data in three different conditions: (a) correctly specified hypothesized model (i.e., the hypothesized model was equal to the population model as shown in Fig. 1, MC); (b) misspecification in between-level model only (MB, see Fig. 2); and (c) misspecification in within-level model only (MW, see Fig. 2). Following Ryu and West’s (2009) and Hsu et al.’s (2017) studies, MB contained a misspecification where only one between-level factor loaded on the indicators. MW contained a misspecification where only one within-level factor loaded on the indicators. The WLSMV estimator was applied in three conditions to obtain the model solutions in Mplus. Starting values of parameter estimates were set to the same as the parameter values in the population model to prevent any convergence problems due to bad starting values. Fit indices of interest were computed in each condition.

Fig. 2
figure 2

a Misspecification in the between-level model only (MB). b Misspecification in the within-level model only (MW)

Analysis

In MC, MB, and MW conditions, the values of fit indices were saved for subsequent analyses. Deceptive statistics for level-specific χ2 test statistics and fit indices were computed and reported. If needed, factorial analysis of variance (ANOVA) was applied to determine the impact of the design factors on the performance of fit indices (Skrondal, 2000). Specifically, eta-squared (η2) was reported to indicate the proportion of the variance accounted for by a particular design factor or the interaction effect terms. Following Cohen’s (1988, 1992) suggestion, we used a moderate η2 above .06 (i.e., practically significant) to identify influential design factors in the fit indices values. Note that when a fit index had a standard deviation close to 0, the impact of design factors on the values of the fit index were self-evidently trivial, regardless of the η2 values. In addition, for level-specific χ2 test statistics, we computed the type I error rates (under MC) or statistical power (under MB and MW) under α level = .05. For level-specific fit indices, we applied the traditional cut-off values (RMSEA-related fit indices < .06; CFI- and TLI-related fit indices > .95; SRMR-related fit indices < .08; Hu & Bentler, 1999) to explore whether these fit indices were promising to indicate correctly specified or misspecified hypothesized models. Note that our intention was not to encourage using traditional cut-off values as golden rules for model evaluation. Rather, we intended to provide a sense of whether these cut-off values are applicable when level-specific χ2 test statistics or fit indices are used.

Results

Convergence rates

The convergence rates were highly associated with magnitude of ICC and FL, followed by sample size (NG and GS). In contract, THR was less influential. In this section, the convergence rates were reported by ICC and FL. First, the convergence rates were above 95% in conditions with high ICC (.333) and high FL (.70), with one exception (the convergence rate was 92% when NG = 50, GS = 10, and THR = 80:20). In conditions with high ICC and medium FL (.55), the coverage rates were close to or above 95% only when NG/GS were at least 100/20 or NG/GS were at least 200/10. On the other hand, in conditions with high ICC and low FL (.40), the coverage rates were below 95% (ranging between 46% and 89%). Second, in conditions with medium ICC (.231) and high FL, the convergence rates were above 95% when NG/GS were at least 100/10 and THR was symmetric. In contrast, in conditions with medium ICC and medium FL, the convergence rates were above 95% only when NG/GS were at least 200/50. Unfortunately, in conditions with medium ICC and low FL, the coverage rates were below 95% (ranging between 37% and 63%). Third, in conditions with low ICC (.091), the coverage rates were below 95% – in range of 48% and 84% when FL was high; in range of 40% and 52% when FL was medium; and in range of 29% and 47% when FL was low. Only converged solutions were used for further analysis.

Effects of design factors on level-specific fit indices

The left-hand side of Table 1 summarizes the descriptive statistics [aggregated means and standard deviations (SDs)] of level-specific χ2 test statistics and fit indices for the MC, MB, and MW conditions. The right-hand side of Table 1. shows the η2 values derived from ANOVA results using level-specific χ2 test statistics and fit indices as outcomes. We have highlighted η2 values above .06 (Cohen, 1988, 1992) in grey to show practically significant effects of design factors. Note that when a level-specific χ2 test statistic or fit index has a small variation (indicated by a small SD), the impact of design factors was self-evidently trivial even if any η2 exceeding .06 is identified. Note the η2 values of three-way interactions were close to 0, and therefore, were not reported in Table 1 for the sake of simplicity. In addition, to inform the performance of level-specific χ2 test statistics and fit indices, we report the average values of level-specific χ2 test statistics and fit indices by all design factors in Appendix B (Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21), where type I error rates or statistical power are reported for level-specific χ2 test statistics.

Table 1 ANOVA results (η2) with χ2 test statistics and fit indices values as the dependent variables for the * NG * GS * ICC * THR * FL design

In the following section, we reported the simulation of between-level-specific and within-level-specific χ2 and fit indices separately. To better inform interested readers of our findings, we created Tables 2 and 3 to summarize the simulation results for between-level-specific and within-level-specific χ2 and fit indices, respectively. In addition, our simulation results suggest the performance of between-level-specific χ2 and fit indices was complicated and more likely influenced by NG, ICC, and FL. We visualized the values of between-level-specific fit indices in Figs. 3 and 4 (by NG, ICC, and FL with GS set to 50.) for MC and MB conditions, respectively. Also, because we found THR could slightly weigh in the performance of between-level-specific fit indices in some conditions, the visualization was presented for symmetric THR and asymmetric THR conditions separately. Note we did not present the visualization of within-level-specific fit indices because the patterns of their performance were less complicated and could be clearly described in Table 3.

Table 2 Summary of simulation findings for between-level-specific χ2 and fit indices
Table 3 Summary of simulation findings for within-level-specific χ2 and fit indices
Fig. 3
figure 3

Values of between-level-specific fit indices under MC. Note. MC = correctly specified hypothesized model. NG = numbers of groups. GS = group size. ICC = intraclass correlation coefficient. THR = threshold of dichotomous indicators (symmetric = 50:50 and asymmetric = 80:20). FL = factor loadings.

Note. MC = correctly specified hypothesized model. NG = numbers of groups. GS = group size. ICC = intraclass correlation coefficient. THR = threshold of dichotomous indicators (symmetric = 50:50 and asymmetric = 80:20). FL = factor loadings

Fig. 4
figure 4

Values of between-level-specific fit indices under MB. Note. MB = misspecification in between-level model only. NG = numbers of groups. GS = group size. ICC = intraclass correlation coefficient. THR = threshold of dichotomous indicators (symmetric = 50:50 and asymmetric = 80:20). FL = factor loadings

Between-level-specific χ 2 and fit indices

\({\boldsymbol{\chi}}_{\boldsymbol{B}}^{\textbf{2}}\) . As shown in Table 1, under MC where the hypothesized between-level model was correctly specified, \({\chi}_B^2\) had a mean of 31.53, an SD of 6.45, and the η2s ranged from .00 to .03. Small values of η2s suggest that five design factors (NG, GS, ICC, THR, and FL) accounted for trivial variance of \({\chi}_B^2\). That is, our design factors had a negligible effect on \({\chi}_B^2\). Appendix B shows that the type I error rates of \({\chi}_B^2\) were generally acceptable (close to or below .05) across all conditions. On the other hand, under MB, where the hypothesized between-level model was misspecified, \({\chi}_B^2\) had a mean of 45.22 and an SD of 23.77 (see Table 1). FL (η2 = .10), ICC (η2 = .09), NG (η2 = .06), and ICC*FL (η2 = .07) can jointly affect \({\chi}_B^2\). Based on data in Appendix B, we found \({\chi}_B^2\) had higher statistical power to detect misspecified between-level models when FL, ICC, or NG increased. These results show that when \({\chi}_B^2\) was used, type I error rates might not be a concern, but statistical power would be of problem when FL, ICC, or NG were low.

We further explore the statistical power of \({\chi}_B^2\) reported in Appendix B. First, we found statistical power was far below .80 under several conditions: NG = 50 or GS = 10, or ICC1 or low FL. Second, we found \({\chi}_B^2\) was able to reach satisfying statistical power (close to or above .80) under ICC3 with (a) high FL and NG = 100, or (b) medium FL and NG = 200. If ICC was medium (ICC2), only the following condition resulted in satisfying statistical power: high FL, NG/GS at least 200/20, and symmetric THR. The findings regarding the performance of \({\chi}_B^2\) were summarized in Table 2.

RMSEAB

Table 1 shows that under MC, RMSEAB had a close-to-zero mean (0.01) and SD (0.02), suggesting that RMSEAB had a desired characteristic of indicating good model fit across replications when the between-level model was correctly specified. The η2s of RMSEAB were below .02, indicating that the design factors had a trivial impact on RMSEAB. Figure 3 and Appendix B show that RMSEAB had means below the traditional cut-off value of .06 (Hu & Bentler, 1999) in all conditions. Alternatively, under MB, as shown in Table 1, RMSEAB had a mean of 0.03 and an SD of 0.04. ICC (η2 = .14) and FL (η2 = .14) can dominantly account for the variance of RMSEAB, followed by ICC*FL (η2 = .08). According to Fig. 4 and Appendix B, we found RMSEAB was more likely to identify misspecified hypothesized between-level models when ICC or FL increased. In addition, the impact of ICC on RMSEAB was small when FL was low, and the impact of ICC became greater when FL increased. These results show that low magnitudes of ICC or FL in models might interfere with the capacity of RMSEAB to correctly determine model fit.

We dug into tables in Appendix B to determine the extent to which RMSEAB could be used to determine model fit when the cut-off value of .06 was applied. We found RMSEAB cannot be used to determine model fit under ICC1 or low FL conditions. On the other hand, RMSEAB could be widely used in conditions with high ICC (ICC3) and high FL. In conditions with ICC3 and medium FL, increasing NG to at least 200 could prevent RMSEAB from under-rejecting poor model fit. The performance of RMSEAB is summarized in Table 2.

CFI B and TLI B

Under MC, CFIB and TLIB were expected to be above the traditional cut-off value of .95 (Hu & Bentler, 1999). However, as reported in Table 1, CFIB and TLIB had means (0.81 and 0.79, respectively) far below .95 and the SDs (0.35 and 0.36, respectively) were not negligible. This result suggested that CFIB and TLIB were not necessarily promising to indicate correctly hypothesized between-level models across replications. Their η2s provided more information about when these two fit indices could be useful. Results show our design factors influenced CFIB and TLIB in a similar way. Particularly, NG (η2 = .08 and .07 for CFIB and TLIB, respectively), ICC (η2 = .09 for CFIB and TLIB), and FL (η2 = .10 and .09 for CFIB and TLIB, respectively) had a similar magnitude of impacts on CFIB and TLIB. Based on Appendix B and Fig. 3, we found that both fit indices were more promising to identify correctly hypothesized between-level models when FL, ICC, or NG increased.

Under MB, CFIB and TLIB also acted similarly. As presented in Table 1, CFIB and TLIB had means (0.65 and 0.59, respectively) far below the traditional cut-off value of .95, which were satisfying, with SDs equal to 0.34 and 0.36, respectively. The values of η2s ranged from .00 to .04, suggesting that design factors had minimal effects on CFIB and TLIB. Taken together, the results derived from MC and MB conditions show that low magnitudes of NG, ICC, or FL in models might limit the performance of CFIB and TLIB in correctly determining model fit.

We then examined tables in Appendix B to understand the performance of CFIB and TLIB under different conditions when the cut-off value of .95 was applied. First, we found that CFIB and TLIB cannot be used to determine model fit under ICC1 or low FL conditions. Second, if either ICC or FL were medium, increasing NG to at least 200 could prevent CFIB and TLIB from over-rejecting good model fit. Third, both indices could be widely used in conditions with ICC3, high FL, and NG at least 100. We summarized the performance of CFIB and TLIB in detail in Table 2.

SRMRB

Under MC, SRMRB had a mean of 0.09, which was greater than a traditional cut-off value of .08; (Hu & Bentler, 1999), and an SD of 0.03 (see Table 1), suggesting that SRMRB might not be able to correctly indicate good model fit in all conditions. NG (η2 = .56) accounted for a substantial proportion of SRMRB’s variances followed by GS (η2 = .09). Figure 3 and Appendix B shows that SRMRB was more promising to identify correctly hypothesized between-level models when NG or GS increased.

On the other hand, under MB, SRMRB had a mean of 0.10 and an SD of 0.03 (see Table 1.) and it can be impacted by NG (η2 = .42), FL (η2 = .08), and GS (η2 = .07). Based on Appendix B, we found SRMRB was more promising when FL increased. However, we also found SRMRB was less promising to correctly indicate poor models when NG or GS increased. After we took a close look at Figs. 3 and 4 as well as tables in Appendix B, we found this result was not surprising and could result from the fact that the patterns/values of SRMRB found under MC were reproduced under MB in many conditions, including low NG, low FL, ICC1, or ICC2 with medium FL [note ICC and ICC*FL Table 1 (η2 = .03, respectively) in variance of SRMRB]. SRMRB performance under MB should be interpreted cautiously. Put together, one should be aware that low magnitude of NG, FL, or ICC might cause SRMRB to lose the capacity to correctly determine model fit.

Using the tables within Appendix B, we found some conditions determining whether SRMRB could be useful when the cut-off value of .08 was applied. First, as aforementioned, we found SRMRB cannot be used to determine model fit under (a) low NG, (b) low FL, (c) ICC1, or (d) ICC2 with medium FL conditions. Second, in conditions with (a) ICC3 and high/medium FL or (b) ICC2 with high FL, SRMRB was useful if NG/GS were at least 100/50 or NG/GS were at least 200/10. When asymmetric THR appeared, NG/GS of at least 200/10 could be considered. We summarized the performance of SRMRB in Table 2.

Within-level-specific χ 2 and fit indices

\({\boldsymbol{\chi}}_{\boldsymbol{W}}^{\textbf{2}}\) . As presented in Table 1, under MC, \({\chi}_B^2\) had a mean of 31.84, an SD of 7.14, and η2s ranging from .00 to .09. GS was the only influential design factor (η2 = .09); however, Appendix B shows that type I error rates of \({\chi}_W^2\) were close to or below .05 across all conditions. That is, GS did not significantly affect the capacity of \({\chi}_W^2\) to identify correctly hypothesized within-level models. In contrast, under MW, \({\chi}_W^2\) had a mean of 306.75 and an SD of 377.37, as shown in Table 1. Large variations in values of \({\chi}_W^2\) resulted from FL (η2 = .14), NG (η2 = .11), GS (η2 = .08), NG*FL (η2 = .06), and GS*FL (η2 = .06). Based on Appendix B, we found \({\chi}_W^2\) had higher statistical power (close to or above .80) to detect misspecified within-level models when NG, GS, and FL increased. The above results show when using \({\chi}_W^2\), one should be aware that type I error rates might not be a problem, but low statistical power might appear when NG, GS, or FL were low.

We examined Appendix B to uncover the conditions where \({\chi}_W^2\) had statistical power close to or above .80. First, when FL was high, \({\chi}_W^2\) was able to reach satisfying statistical power given our minimum combination of NG/GS (50/10). Second, when FL was medium, thresholds of dichotomous indicators started to slightly weigh into the requirement of NG and GS (THR had a η2 of .03, as shown in Table 1): NG/GS = 50/10 was sufficient for symmetric THR, while NG/GS of at least 50/20 or at least 100/10 was required for asymmetric THR. Third, when FL was low, THR could weigh in a little more: NG/GS of at least 100/20 or 200/10 was sufficient for symmetric THR; NG/GS of at least 100/50 or 200/20 for asymmetric THR. The performance of \({\chi}_W^2\) is summarized in Table 3.

RMSEAW

Table 1 shows that under MC, RMSEAW had a mean equal to 0.00 and an SD equal to 0.01, suggesting that RMSEAW shows a desired property of indicating good model fit across replications when the within-level model was correctly specified. GS (η2 = .10) was the only factor that significantly influenced RMSEAW. Nevertheless, the impact of GS on RMSEAW could be ignored because the variation of RMSEAW was close to zero; thus the impact of GS was self-evidently trivial. RMSEAW had means below a traditional cut-off value of .06 (Hu & Bentler, 1999) in all conditions.

On the other hand, under MB, as shown in Table 1, RMSEAW had a mean of 0.04, an SD of 0.02. FL (η2 = .56) dominantly affected RMSEAW, followed by THR (η2 = .09). The tables in Appendix B show that RMSEAW was more likely to identify misspecified hypothesized within-level models when FL was increased or THR was symmetric. After examining tables in Appendix B, we found RMSEAW tended to under-reject poor model fit when the cut-off value of .06 was applied. Particularly, results show RMSEAW can be used to determine model fit in very restricted conditions: when FL was high and THR was symmetric. The performance of RMSEAW is summarized in Table 3.

CFI W and TLI W

In general, CFIW and TLIW performed similarly. As presented in Table 1, under MC, both CFIW and TLIW had a mean equal to 0.99 and an SD equal to 0.03, suggesting that these two fit indices were able to correctly indicate good model fit across replications. The η2 of GS was .07 for CFIW and .08 for TLIW; however, the impact of GS could be neglected due to a small variation in CFIW and TLIW. Under Mw, CFIW and TLIW had means of 0.74 and 0.66, and SDs of 0.08 and 0.10, respectively. Although FL seemed to affect CFIW and TLIW (η2 = .09), Appendix B reveals that both CFIW and TLIW had means far below a traditional cut-off value of .95 (Hu & Bentler, 1999) in most conditions, suggesting that these two fit indices were promising to identify incorrectly hypothesized within-level models.

We examined Appendix B to understand the sensitivity of CFIW and TLIW under different conditions when the cut-off value of .95 was applied. We found that CFIW and TLIW can be widely used with the cut-off value of .95 except for the conditions with low FL, NG/GS = 50/10, and asymmetric THR. We summarized the performance of CFIW and TLIW in Table 3.

SRMRW

Under MC, SRMRW had a mean of 0.04, which was below a traditional cut-off value of .08 (Hu & Bentler, 1999), and a small SD of 0.02 (shown in Table 1). The results suggest that SRMRW showed promise to correctly indicate good model fit in most replications. NG (η2 = .29), GS (η2 = .40), and THR (η2 = .08), were practically significant but could be ignored due to a small variation in SRMRW. On the other hand, under MB, SRMRW had a mean of 0.09, which was above a traditional cut-off value of .08, and a small SD of 0.02, suggesting that SRMRW was useful to indicate poor model fit in most replications. The influence of FL (η2 = .54) and GS (η2 = .06) could be ignored since the variation in SRMRW was trivial. Using Appendix B, we found SRMRW with a cut-off value of .08 can be used to determine model fit in all conditions except for the conditions with low FL. We summarized the performance of SRMRW in Table 3.

Discussion

The current study examined the performance of level-specific χ2 test statistics and fit indices derived from the partially saturated-model method in correctly identifying good or poor model fit in MCFA with dichotomous indicators. Tables 2 and 3 summarize our simulation results that could inform applied researchers of their usage of level-specific χ2 test statistics and fit indices in evaluating MCFA with dichotomous indicators. Note that the results were tied to the specific conditions of this research and should not be overgeneralized. In this section, we highlight major findings for discussion.

First, as summarized in Table 2, the performance of b-l-s χ2 and fit indices was highly associated with the magnitude of ICC in data. Both b-l-s χ2 and fit indices were more promising indicators to correctly indicate model fit in MCFA with dichotomous indicators when ICC increased. Unfortunately, in conditions with low ICC (ICC1 = .091), we found \({\chi}_B^2\) had statistical power far below .80, and all b-l-s fit indices (RMSEAB, CFIB, TLIB, and SRMRB) were not useful to evaluate between-level models along with traditional cut-off values. This is congruent with previous findings derived from MCFA with continuous indicators (Hsu et al., 2017; Lee & Sohn, 2022). Note that between-level indicators have stronger relations when ICC is higher, allowing for possibly “greater” discrepancies between the model-implied and observed between-level variance-covariance matrices (Kline, 2011). As a result, given identical misspecification and sample size, the value of \({\chi}_B^2\) increased (higher statistical power to detect the misspecification) with higher ICC. Because RMSEAB is a function of the \({\chi}_B^2\), the pattern of RMSEAB in response to ICC was similar to \({\chi}_B^2\). Regarding CFIB and TLIB, both were functions of the distance between the χ2 value of the hypothesized model and that of the baseline null model—the greater the distance between two χ2 values, the larger were CFIB and TLIB. When the between-level model was misspecified, a higher ICC not only led to greater χ2 values of the hypothesized model and the baseline null model, but also resulted in larger distances between these two χ2 values. That is why we observed that both CFIB and TLIB were more promising for detecting misspecified between-level models when ICC increased. Similarly, SRMRB was more effective when ICC was higher. This is because SRMRB reflects the deviation between the between-level model-implied and observed variance-covariance matrices; the deviation would become larger with an increase in ICC. On the other hand, we found that ICC did not affect the effectiveness of w-l-s χ2 and fit indices because the within-level variation was held constant across the ICC conditions in this study (i.e., within-level indicators had constant relations across ICC conditions).

Second, as presented in Table 2, another factor influencing the performance of b-l-s χ2 and fit indices was the magnitude of factor loadings. Decreasing factor loadings were associated with decreasing value of \({\chi}_B^2\) and of all b-l-s fit indices. In conditions with low FL (.40), the statistical power of \({\chi}_B^2\) was far below .80, and no b-l-s fit indices can correctly identify model fit along with traditional cut-off values. These findings are in line with Nestler’s (2013) and Heene et al.’s (2011) studies. As noted by Heene et al. (2011), the problem of using \({\chi}_B^2\), RMSEAB, and SRMRB in low FL conditions is that the values of these indicators will be too small to detect the misspecified between-level models. On the other hand, the problem of using CFIB and TLIB in low FL conditions is that these indicators will over-reject correctly hypothesized between-level models.

Third, in addition to ICC and FL, the size of NG played an important role in the performance of \({\chi}_B^2\) as well as b-l-s fit indices if traditional cut-off values were applied. As shown in Table 2, a small to medium NG (50–100) might be sufficient for b-l-s χ2 and fit indices only if both ICC and FL were high. In the remaining conditions, an NG of 200 was needed to use b-l-s χ2 and fit indices for between-level model evaluation. These findings raise a pressing issue—when both ICC and FL are not high, how can we assess the between-level model fit in MCFA with dichotomous indicators given an NG far less than 200? Thus far, the possible solution we are aware of is to compute adjusted \({\chi}_B^2\) test statistics based on the post hoc corrections (Herzog & Boomsma, 2009; Savalei, 2010). Specifically, when the NG is small (i.e., small sample size at the between-level), the estimated \({\chi}_B^2\) might not follow χ2 distribution (Bentler & Yuan, 1999). Previous research has shown that computing adjusted χ2 is promising in providing sufficient statistical power and reliable model fit for data with nonnormality or small sample size (Savalei, 2010). To the best of our knowledge, the effectiveness of adjusted \({\chi}_B^2\) test statistics based on the post hoc corrections in small NG scenarios has not been studied in MCFA; future studies are needed to shed new light on this issue. In addition, little is known about whether b-l-s fit indices computed based on adjusted \({\chi}_B^2\) are useful for detecting misspecified between-level models in MCFA. Future studies are encouraged to explore this topic.

Fourth, as shown in Table 1, THR had close-to-zero η2s for b-l-s χ2 and fit indices, suggesting the impact of asymmetric thresholds (80:20) on the performance of b-l-s χ2 and fit indices might be trivial. Still, we did discover that the impact of THR can slightly weigh into the performance of b-l-s χ2 and fit indices when ICC was not high. For example, as summarized in Table 2, when ICC was medium (ICC2), we can reach a satisfying statistical power of \({\chi}_B^2\) only in conditions with high FL, large NG (200), and symmetric THR, but not in any conditions with asymmetric THR. This finding is in line with Rhemtulla et al. (2012) and Savalei and Rhemtulla (2013), who found that asymmetric thresholds and small samples reduced adjusted χ2 statistical power. In addition, we observed that under ICC2, b-l-s fit indices might need larger NG/GS to compensate for the presence of asymmetric thresholds. For instance, as shown in Table 2, in conditions with ICC2 and high FL, CFIB and TLIB required NG/GS of at least 100/50 or NG/GS of at least 200/10 if thresholds were symmetric, but CFIB and TLIB would require NG/GS at least 200/50 if thresholds were asymmetric. This finding might inform applied researchers of the sample size requirement when asymmetric thresholds occur.

Fifth, the magnitude of factor loadings and asymmetric thresholds should be considered when using w-l-s χ2 and fit indices. As presented in Table 3, w-l-s χ2 and fit indices except for RMSEAB were useful in most conditions to determine model fit at within-level in MCFA with dichotomous indicators. Our study complements earlier work (Hsu, 2009; Hsu et al., 2017; Ryu & West, 2009) that found that w-l-s χ2 and fit indices performed well in MCFA with continuous indicators, and our findings further support the adequacy of adopting w-l-s χ2 and fit indices to evaluate within-level models in MCFA if indicators are dichotomous. However, we found that both FL and THR could influence the performance of w-l-s χ2 and fit indices. As aforementioned, decreasing factor loadings were associated with decreasing value of χ2 and of fit indices. As a result, one would need to increase NG/GS to compensate for the impact of low factor loadings. In addition, we found THR had a slight impact and could weigh in the performance of \({\chi}_W^2\), RMSEAW, CFIW, and TLIW. Table 3 suggested RMSEAW was heavily affected by FL and THR: RMSEAW can be used to determine model fit only when FL was high and THR was symmetric. The good news is that others fit indices were useful for determining within-level model fit along with traditional cut-off values.

Last but not least, the results suggest that a low ICC (.091) was linked to the low convergence rates in MCFA with dichotomous indicators. This finding was in line with Hox and Maas’s (2001) and Hsu et al.’s (2017) study findings that were based on an MCFA population model with continuous indicators. Due to the fact that ICCs around .10 seem to be very common in applied research. Explore alternative estimation strategies (e.g., Bayesian estimation; see Depaoli & Clifton, 2015) to resolve the issue (i.e., low convergence rates) for multilevel confirmatory factor analysis can be crucial. In other words, stable and reliable estimation methods need to be developed and evaluated before addressing further questions about model fit.

Several limitations should be noted. First, our findings should be generalized only to studies that apply MCFA models with dichotomous indicators (Fig. 1). Further studies are needed to test the replicability of our results using different models (e.g., structural models). Second, only a limited number of design factors were considered in the current study. Future studies could expand the examination of additional design factors, such as unequal factor loadings, unequal group conditions, the number of observed indicators per latent factor, and different types of misspecifications (e.g., misspecified in cross-loadings; Moshagen & Auerswald, 2018). Third, considering the computational complexity in MCFA with dichotomous indicators, we generated 500 replications for each of 162 conditions and investigated one misspecification condition in this study which restricted the ability to generalize. Future research could extend our investigation using a more powerful computer.

Concluding remarks

According to our simulation results, we recommend that practitioners should be aware that the performance of b-l-s χ2 and fit indices was mainly influenced by data ICC and factor loadings (FL), followed by number of groups (NG), while thresholds of dichotomous indicators (THR) could slightly weigh in the performance of b-l-s fit indices in some conditions. Specifically, both b-l-s χ2 and fit indices were more promising indicators to correctly indicate model fit in MCFA with dichotomous indicators when ICC or FL increased. In conditions with low ICC (.091) or low FL (.40), \({\chi}_B^2\) had statistical power far below .80, and all b-l-s fit indices were not useful to evaluate between-level models along with traditional cut-off values. In terms of NG, we found a small to medium NG (50–100) might be sufficient for b-l-s χ2 and fit indices only if both ICC and factor loadings were high, while in remaining conditions, an NG of 200 was needed to be able to use b-l-s χ2 and fit indices for between-level model evaluation. The performance of b-l-s χ2 and fit indices is summarized in Table 2. On the other hand, we recommend that practitioners use w-l-s χ2 and fit indices (except for RMSEAW) along with traditional cut-off values to evaluate within-level models comprising dichotomous indicators. However, practitioners should be aware that both FL and THR could influence the performance of w-l-s χ2 and fit indices. Both w-l-s χ2 and of fit indices were more promising to determine model fit when FL increased. When FL was low, one would need to increase NG and group size (GS) to compensate for the impact of low factor loadings. THR had a slight impact and could weigh in the performance of \({\chi}_W^2\), RMSEAW, CFIW, and TLIW. Unfortunately, RMSEAW was heavily affected by FL and THR: RMSEAW can be used to determine model fit only when FL was high and THR was symmetric. The performance of w-l-s χ2 and fit indices is summarized in Table 3.