Introduction

When multiple groups are compared using factor models, researchers are often interested in the group mean difference. However, prior to any multiple-group analysis, measurement invariance should hold across the groups (DeShon, 2004). Mellenbergh (1989) developed a mathematical expression of measurement invariance concepts using conditional probability:

$$ P\left({Y}_{ij}=y|{\eta}_i,G\right)=P\left({Y}_{ij}=y|{\eta}_i\right), $$
(1)

where ηi is the factor score for ith examinee, Yij is the ith examinee’s response score for jth item, and G is a group membership. Equation 1 states that the probability of a response of ith examinee to the jth item, conditioned on the latent-factor score, is independent of group membership G. In other words, if measurement invariance holds, examinees with the same latent-factor scores are expected to have the same probability of endorsing a response on the measure, regardless of their group membership. Multiple-group confirmatory factor analysis (MGCFA) is perhaps the most widely used method for testing measurement invariance among applied researchers, due to its flexibility and convenience.

Alternatively, multiple-indicator multiple-causes modeling (MIMIC; Jöreskog & Goldberger, 1975) has been employed for detecting measurement invariance and testing for latent mean differences (e.g., Fleishman, Spector, & Altman, 2002; McCarthy, Pedersen, & D’Amico, 2009; Muthén, Kao, & Burstein, 1991; Rubio, Berg-Weger, Tebb, & Rauch, 2003; Woods, Oltmanns, & Turkheimer, 2009). MIMIC for measurement invariance testing has several advantages in model specification. For example, multiple categorical variables (e.g., ethnicity or socioeconomic status) and their interaction terms can be tested simultaneously (e.g., Ainsworth, 2008; Fleishman et al., 2002), and the measurement invariance of a continuous covariate can be investigated (Barendse, Oort, & Garst, 2010).

Measurement invariance, in general, is tested with a sequence of increasingly restrictive models. The sequence begins with the equality of confirmatory factor model configurations across groups (configural invariance), then moves across equality of the factor loadings (metric or weak invariance), intercepts (scalar or strong invariance), and error variances of the observed variables (strict invariance). Homogeneity of factor loadings, intercepts, and error variances across groups (strict invariance) is a necessary condition to enable common factor models to obtain some certainty that measurement invariance holds for multiple groups (e.g., DeShon, 2004; Meredith, 1993).

However, measurement invariance studies have suggested that the equality constraints for factor loadings and intercepts across groups are sufficient for multiple-group analysis (e.g., Jöreskog & Sörbom, 1993; Marsh, 1994; McArdle, 1998; Sörbom, 1974), because the difference in error variances affects only the reliability of observed variables (Little, 1997). In addition, when latent variables are compared across groups, measurement errors are taken into account for the latent variables (Marsh, Nagengast, & Morin, 2013). Thus, the invariance of error variances is often not considered if group mean differences in the latent factors or observed scores are of concern, as long as strong invariance holds for the data.

On the other hand, some scholars have raised concerns about possible impacts of heterogeneous error variances in multiple-group analysis. Lubke and colleagues took admission decisions based on observed scores as an example (Lubke, Dolan, Kelderman, & Mellenbergh, 2003). If heterogeneous error variances truly exist between groups, incorrect admission decisions could be made more frequently for the group with the larger error variance. Heterogeneity of error variances could also mislead interpretation of the results of measurement invariance testing using likelihood ratio (LR) test or model fit indices, because inflated chi-squares or poor model fit could occur when the model is misspecified (i.e., if homogeneous error variances are assumed when there is heterogeneity). Consequently, the measurement invariance test could mislead toward noninvariance, although measurement invariance does hold.

In addition, it is not uncommon to observe correlated error structures in practical situations (e.g., Heene, Hilbert, Freudenthaler, & Bühner, 2012; Lubke et al., 2003). Correlated errors could occur if item contents were overlapped or items were logically dependent upon one another. If the item contents were multidimensional but a unidimensional model was chosen, correlated error structure could also occur because unexplained residuals would be correlated with the unspecified factor. In the context of multiple-group analysis, error covariances could be present in one of the groups compared (Lubke & Dolan, 2003). For example, in cross-cultural studies, respondents in one culture might interpret negatively worded items differently, and the errors of the negatively worded items would possibly be correlated in this cultural group. Previously, researchers have empirically examined the impact of correlated error structures for confirmatory factor analysis (CFA) and have reported that correlated errors could lead to bias in the factor loadings and reliability estimates (e.g., Heene et al., 2012; Raykov, 2001; Shevlin, Miles, Davies, & Walker, 2000). Because an error covariance can indicate the presence of an additional factor, either substantive or nuisance, not specified in the model, ignoring error covariance in one group could be consequential in multiple-group analysis.

In this study, heterogeneity in either error variances or error covariances is considered as a heterogeneous error structure across groups. Of note is that an error covariance present in one group but not in the other group would be considered a violation of configural invariance, because the configuration of the CFA model would not be homogeneous across groups. On the other hand, heterogeneity in the error variances would be considered a violation of strict invariance, given equality of the configurations, factor loadings, and intercepts. Strict invariance indicates homogeneous error variances and covariances in addition to equal factor loadings and intercepts.

Although the issue of noninvariant error variance–covariance structure (or, interchangeably, simply error structure) in multiple-group analysis has been raised, the impact of such heterogeneity on measurement invariance testing and latent-factor mean difference testing has not been systematically investigated to date. Hence, a Monte Carlo study was needed so as to empirically examine the extent to which misspecifying the error structure affects testing measurement invariance and latent-factor mean differences with commonly used multiple-group analysis models—namely, MGCFA and MIMIC. Given that heterogeneity in the error structure is potentially more problematic with MIMIC, because MIMIC does not have the flexibility to specify different error variances or covariances across groups, comparing MGCFA and MIMIC directly would be worthwhile.

MGCFA and measurement invariance

In a single-group confirmatory factor model, continuous random variables Y are regressed on continuous latent variables η. Given that i = 1, . . . , I for examinees and j = 1, . . . , J for items, the single-group confirmatory factor model can be represented as follows:

$$ {Y}_{ij}={\nu}_j+{\sum}_{k=1}^K{\lambda}_{jk}{\eta}_{ik}+{\varepsilon}_{ij}, $$
(2)

where νj, λjk, ηik, and εij denote the intercepts, factor loadings, latent factors, and residuals, respectively. In Eq. 2, there are a total of K latent factors, k = 1, . . . , K. Additionally, εij is assumed to follow a multivariate normal distribution with mean vector 0 and diagonal matrix Θ. The diagonal matrix, Θ, implies an uncorrelated error structure of the confirmatory factor model. To incorporate the measurement invariance concept with MGCFA, suppose there are a total of G groups, denoted as g = 1, . . . , G. Also, let the expected values of the random variables Yij and ηik in vector form be denoted as μg and αg, respectively, for group g. The covariance matrices of the random variables Yij and ηik are denoted Σg and Ψg, respectively, for group g. Then, the mean and covariance of Y for group g can be represented in matrix forms as follows, with the consideration of measurement invariance satisfied:

$$ {\boldsymbol{\mu}}_{\boldsymbol{g}}=\boldsymbol{\nu} +\boldsymbol{\Lambda} {\boldsymbol{\alpha}}_{\boldsymbol{g}}, $$
(3)
$$ {\boldsymbol{\Sigma}}_{\boldsymbol{g}}=\boldsymbol{\Lambda} {\boldsymbol{\Psi}}_{\boldsymbol{g}}{\boldsymbol{\Lambda}}^{\prime }+\boldsymbol{\Theta}, $$
(4)

where Θ is a diagonal matrix of the variance components of errors, and Λ represents the factor loading matrix with respect to latent factors. Equations 3 and 4 imply that (a) the factor loadings are equal across groups (Λg = Λ), (b)the intercepts are equal across groups (νg = ν), and (c) the residual covariance matrices are equal across groups (Θg = Θ). When those conditions are satisfied in MGCFA, it is considered that measurement invariance or factorial invariance holds across groups g = 1, . . . , G (Meredith, 1993). Then, differences in the observed means across groups (μg) are due solely to differences in the factor means across groups (αg); differences in observed variance–covariance (Σg) are due solely to differences in the factor variance–covariances across groups (Ψg).

MIMIC and measurement invariance

Alternatively, in MIMIC, group variables are considered causal indicators of factors. These causal indicators are coded as dummy variables (Xi ), and the effects of the variables can be detected according to this model. For simplicity of the discussion, a single causal indicator for two groups (i.e., reference and focal groups) is included.

$$ {Y}_{ij}={\nu}_j+{\lambda}_j{\eta}_i+{\varepsilon}_{ij}, $$
(5)
$$ {\eta}_i=\gamma {X}_i+{\zeta}_i. $$
(6)

Equation 5 represents the measurement relationships between an observed variable and a latent factor in MIMIC. In Eq. 6, Xi denotes a dummy variable that indicates group membership, γ denotes an effect or path coefficient of the group variable on the latent factor, and ζi represents the disturbance of the latent factor. Given that the mean of the disturbance term for the latent factor is 0, the expected value of the latent factor can be represented as follows:

$$ E\left({\eta}_i\right)=\gamma E\left({X}_i\right), $$
(7)

where γ indicates the group difference in the latent-factor means with a dummy-coded grouping variable, Xi (Thompson & Green, 2006). In other words, the latent-factor mean for the focal group (Xi= 1) is γ units higher (or lower) than that of the reference group (Xi= 0). Also note that the residual variance (Var[εij]) is assumed to be independent (i.e., Cov[εij] = 0) and homogeneous, regardless of the multiple groups.

Measurement invariance testing using MIMIC can be performed by adding a direct path from the grouping variable to the observed variables (Kim, Yoon, & Lee, 2012):

$$ {Y}_{ij}={\nu}_j+{\lambda}_j{\eta}_i+{\beta}_j{X}_i+{\omega}_j{\eta}_i{X}_i+{\varepsilon}_{ij}, $$
(8)
$$ {\eta}_i=\gamma {X}_i+{\zeta}_i. $$
(9)

Statistical significance of the direct path from the grouping variable to the observed variable (βj) indicates that the intercept of item j is not invariant across groups; this is referred to as uniform noninvariance (Woods & Grimm, 2011).

Similarly, nonuniform measurement noninvariance can be tested using MIMIC, by adding a path from an interaction term between the latent factor and the grouping variable to the observed variables (ηiXi, in Eq. 8). Statistical significance of the path from the interaction term to the observed variable (ωj) implies factor loading noninvariance or nonuniform noninvariance of the associated item (e.g., Barendse et al., 2010; Barendse, Oort, Werner, Ligtvoet, & Schermelleh-Engel, 2012; Woods & Grimm, 2011).

Noninvariance of the error structure in MIMIC and MGCFA

The issue of noninvariance in error variance–covariance can be capitalized on with MIMIC, because MIMIC inherently assumes strict invariance. Relaxing the equality of factor loadings and intercepts between groups is possible, as we explained earlier. However, relaxing the equality of residual variances–covariances between groups is challenging, which is one of the major limitations of MIMIC. Previously, Kim, Yoon, and Lee (2012) investigated the performance of MIMIC when strict invariance was incorrectly specified in the presence of factor loading noninvariance. Their study concluded that MIMIC could not detect noninvariance of the factor loadings properly, and they recommended using MIMIC only when factor loading invariance is established, unless the factor loading equality is relaxed by including the Factor × Grouping variable interaction. A question remained unsolved, however: How does MIMIC behave in the presence of the residual variance–covariance noninvariance when MIMIC assumes strict invariance? Moreover, sensitivity of the model fit indices to violation of the strict invariance assumption of MIMIC is worth investigating when MIMIC is misspecified for error structures between groups. MGCFA, on the other hand, has greater flexibility in modeling the error structures between groups. That is, in MGCFA the error variance–covariance matrix can be freely estimated between groups (Θ1 ≠ Θ2).

Purpose of the study

The purpose of the present study was to investigate the impact of misspecified error structure on measurement invariance testing and latent-factor mean estimation when MGCFA and MIMIC are used for multiple-group analysis. More specifically, we examined the performance of both metric and scalar invariance tests following typical measurement invariance procedures under conditions in which either error variance or error covariance was heterogeneous across groups. For each type of misspecification, we further examined the accuracy of the latent-factor mean estimations using MGCFA and MIMIC under the assumption of strict invariance. Finally, we investigated the sensitivity of the model fit indices to the misspecification of error structure.

Method

Simulation design

A simulation study was conducted to investigate how the heterogeneity in residual variance–covariance made an impact on testing configural, metric, and scalar invariance with MGCFA and MIMIC. The simulation conditions included manipulations of (1) the type of error structure misspecification (heterogeneous error variances vs. heterogeneous error covariances), (2) the size of the heterogeneity (small vs. large), (3) the number of heterogeneous items (one vs. two), (4) the sample size for each group (100, 200, 500, 1,000), and (5) the size of the population latent-factor mean differences (0, .1, .5). A total of 96 (2 × 2 × 2 × 4 × 3) conditions were included, and 1,000 replicated data were generated for each condition.

Data generation

Data were generated on the basis of a unidimensional single-latent-factor model with six observed variables (Y1–Y6) for two groups. The generating parameters for the reference group are presented in Table 1. The factor loadings and intercepts were simulated as being homogeneous between groups. The latent-factor variance was fixed at 1 for both the reference and focal groups, whereas the latent-factor mean was 0 for the reference group and 0, .1, or .5 for the focal group, depending on the simulation condition. For the reference group, the generating parameter values for residual variances were computed with the parameterization such that λ2 + θ2 = 1. As a result, the reliability coefficient (ω) for the reference group was .87. For the focal group, because the residual variances were manipulated across conditions, the λ2 + θ2 = 1 parameterization was not applied. Also, the reliability coefficient was varied ranging from .84 to .86, depending on the size and number of misspecifications (more severe misspecification resulted in a smaller reliability). All items were generated as continuous variables and were assumed to follow a multivariate normal distribution. The generating parameters were based on previous MGCFA and MIMIC studies (e.g., Kim et al., 2012).

Table 1 Generating parameter values for the reference group

Simulation conditions

Type of error structure misspecification

We manipulated error structure misspecifications in two ways: (a) heterogeneity of error variances and (b) heterogeneity of error covariances between the reference and focal groups. For the heterogeneous error variance condition, the focal group had higher error variances than did the reference group. For the heterogeneous error covariance conditions, the focal group’s errors were correlated. We included this type of heterogeneity because, in practice, errors are not always independent (e.g., Heene et al., 2012; Lubke et al., 2003), and error covariances could occur only in one of the groups. To the authors’ best knowledge, few studies have investigated the impact of heterogeneous variance–covariance in MGCFA (e.g., Green & Hershberger, 2000; Lubke & Dolan, 2003), and no studies have examined the performance of MIMIC under such conditions.

Size of heterogeneity

We considered two levels of the size of heterogeneity: small and large. For the heterogeneous error variance condition, the focal group had error variances .2 (small) or .4 (large) higher than those in the reference group. Similarly, for the heterogeneous error covariance condition, the error covariance in the focal group was included at two levels, .2 (small) or .4 (large), which correspond to correlations of about .4 and .8, respectively. It should be noted that small covariance (.2) is more commonly observed in applied research, and the generated large covariance (.4) was considered to be more extreme conditions than usual.Footnote 1

Number of heterogeneous items

We included a scenario in which the number of heterogeneity was one or two. Note that the total number of items considered in this study across conditions was six (Y1–Y6). Thus, for the heterogeneous error variance conditions, 17% or 33% of the variables were not invariant in conditions of one or two heterogeneity, respectively; for the heterogeneous error covariance conditions, 33% (one pair) or 67% (two pairs) of the items were involved in error covariances. For the condition in which the heterogeneity number was one, Y2 was selected as the heterogeneous item when error variances were heterogeneous, and Y2 and Y3 were selected as the correlated error items when error covariances were heterogeneous. When the heterogeneity number was two, Y2 and Y5 were selected as the heterogeneous error variance items, and two pairs (Y2 and Y3, Y4 and Y5) were selected as the heterogeneous covariance items.

Group size

Four levels of group size were considered: 100, 200, 500, and 1,000 in each group. A balanced group design (i.e., equal sample sizes for the reference and focal groups) was considered across simulation conditions.

Size of the population factor mean difference

The effect size of the population factor mean difference was manipulated to have three levels in this study—0, .1, and .5—which represent no, small, and large factor mean differences, respectively. These effect sizes had commonly been used in previous simulation studies (e.g., Barendse et al., 2010; Kim et al., 2012). For factor mean difference testing, Type I errors were estimated when no mean difference was generated in the population; power was estimated with small and large mean differences.

Measurement invariance tests

A series of measurement invariance tests (configural, metric, and scalar) were conducted with MGCFA and MIMIC under the simulated conditions. Of note is that the measurement invariance tests were conducted only for the conditions in which the latent-factor mean difference was .1. We chose these conditions because the size of the latent-factor mean difference (0, .1, or .5) does not have an impact on measurement invariance testing as long as the factor means are correctly specified (i.e., allowed to be different across groups). For the measurement invariance test using MGCFA, we used likelihood ratio (LR) tests for nested models. That is, a configural-invariance model in which the factor loadings, intercepts, and residual variances were relaxed between groups was compared to a metric-invariance model in which the factor loadings were constrained to be equal between groups. Similarly, scalar invariance was tested by comparing the metric-invariance model and the scalar-invariance model, in which the intercepts were additionally constrained to be equal. It should be kept in mind that MGCFA was a correctly specified model for the heterogeneous error variance condition, because error variances were allowed to be different between groups. However, for the heterogeneous error covariance condition, it is a misspecified model, because MGCFA assumes independent error structures for both groups. For the measurement invariance test using MIMIC, a configural-invariance model was constructed by including two paths (βj and ωj in Eq. 8) for all items except the first one for identification: a path from the grouping covariate to each item, and a path from the interaction between the grouping covariate and the latent factor to each item. Then, the metric-invariance model was constructed by constraining all path coefficients from the interaction to the items (ωj) at zero. The scalar-invariance model was constructed with additional zero constraints on the paths from the grouping covariate to all items (βj). Note that when configural-, metric-, and scalar-invariance MIMIC models were fitted, the latent-factor mean difference between two groups (γ) was also simultaneously estimated. Similar to the LR tests in MGCFA, these nested models were compared sequentially (i.e., configural vs. metric, metric vs. scalar) to determine the measurement invariance. We used the Satorra–Bentler correction (Satorra & Bentler, 2001) for the LR tests in MIMIC, because robust maximum likelihood (MLR) was used for the model estimation.

The MIMIC model with Factor * Covariate interaction is often estimated with MLR rather than maximum likelihood (ML) estimation because a numerical integration algorithm is required (i.e., TYPE = RANDOM and ALGORITHM = INTEGRATION in Mplus). Because we generated the response data from a multivariate normal distribution, we did not expect any substantial difference between MLR and ML. In the preliminary study, we compared the performance of ML and MLR and did not find any notable differences for MIMIC (e.g., the ML and MLR outputs with several replications were identical). Thus, the choice between ML and MLR would not impact the results of this study.

In addition to the LR tests, measurement invariance was evaluated with Wald tests in MIMIC. In the configural-invariance model, the statistical significance of a ωj path coefficient indicates the lack of nonuniform or metric invariance of the tested item; the statistical significance of a βj path coefficient indicates a violation of uniform or scalar invariance of the tested item. For parameterization of the models, we fixed the first observed variable to be equal across groups, and parameters of the other items were allowed to be estimated. Of note is that MIMIC was a misspecified model for heterogeneity of both error variances and covariances, because MIMIC does not allow for modeling heterogeneity in errors.

Latent-factor mean tests

In addition to measurement invariance tests, in the present study we further explored the accuracy of the latent-factor mean differences across groups when the error structure was misspecified. To create misspecification in the error structure, we constrained the error structures to be equal between groups when they were heterogeneous in the population. In other words, the latent-factor mean difference was tested assuming that strict invariance was satisfied; that is, the error structures were constrained to be equal between groups (i.e., with equal error variances and no error covariances), in addition to the equality of factor loadings and intercepts. For MIMIC, a grouping covariate was included in the model to test a latent-factor mean difference (γ in Eq. 9). Because strict invariance was assumed, no other paths from the grouping variable (i.e., βjXi and ωjηiXi in Eq. 8) were included in the model. In this model the heterogeneous error structure was the only source of model misspecification, because the factor loadings and intercepts were generated to be equal in the population. The statistical significance of the γ coefficient indicated a statistically significant latent group mean difference. To investigate the behaviors of MGCFA in factor mean difference testing under the same error structure misspecification, a strict-invariance model was constructed (i.e., strict-invariance MGCFA), as in MIMIC. Then the latent-factor mean difference was evaluated by testing the statistical significance of the second group’s latent-factor mean, because the first group’s mean was constrained to be 0 and the second group’s mean represented the mean difference between the groups. Thus, both strict MGCFA and MIMIC were considered as incorrectly specified models. Strict-invariance MGCFA was examined because it has theoretical similarities to MIMIC (Woods, 2009), and it was worthwhile to compare their performance in terms of parameter estimation and model fit indices. For both MGCFA and MIMIC, maximum likelihood estimation was used for the model estimation.

In addition to the strict-invariance models, we fitted the correctly specified MGCFA in order to establish the baseline results for latent-factor mean estimation and model fit index sensitivity. As this was a correctly specified model, the factor loadings and intercepts were constrained to be equal between groups. However, the error variances were freely estimated. For the heterogeneous error covariances conditions, errors were additionally allowed to be correlated for the designated items for one group, as had been generated in the population. Mplus 7 (Muthén & Muthén, 2012) was used for both generating the data and fitting the models. The Mplus program code for the study can be obtained from the authors upon request.

Simulation analysis

Rejection rates

We investigated how error structure misspecification affected tests for metric and scalar invariance when the factor loadings and intercepts were invariant in the population. For simulation outcomes, we examined the rejection rates of metric and scalar invariance. The rejection rate of metric invariance was computed as the proportion of replications in which metric invariance was rejected at alpha .05; the rejection rate of scalar invariance, as the proportion of replications in which scalar invariance was rejected. For the Wald test of MIMIC, the rejection rates were computed as the proportions of replications in which any of the tested path coefficients across five items (Y2–Y5) was flagged with statistical significance at alpha .01. The significance level was adjusted to .01 (.05/5 items) in order to control for experiment-wise Type I errors.

Relative bias, RMSE, and SE

The relative biases of the latent-factor mean difference were computed for MGCFA and MIMIC, to investigate the accuracy of latent-factor mean difference estimation. In addition, the root mean squared error (RMSE) and standard error (SE) of the parameter estimates were examined. The relative bias and RMSE were computed as

$$ \mathrm{Relative}\ \mathrm{Bias}={R}^{-1}{\sum}_{i=1}^R\frac{\widehat{\theta_i}-\theta }{\theta }, $$
(10)
$$ \mathrm{RMSE}=\sqrt{R^{-1}{\sum}_{i=1}^R{\left(\widehat{\theta_i}-\theta \right)}^2,} $$
(11)

where θ and \( \widehat{\theta_i} \) represent the generated and estimated parameters for ith replication, respectively, and R represents the total number of replications (i.e., R = 1,000). For the conditions in which the generating parameter was 0, we simply computed bias in the traditional fashion (i.e., \( {R}^{-1}\sum \limits_{i=1}^R\widehat{\theta_i}-\theta \)). A relative bias above .05 is considered as representing biased estimates of the parameters (Hoogland & Boomsma, 1998). The SE was obtained by averaging the standard errors across 1,000 replications.

Power and Type I error

We also evaluated the statistical inference of the latent-factor mean estimates. When the true population effect size was 0, the Type I error rates of the latent-factor mean difference tests were computed for MGCFA and MIMIC. When the true population effect size was .1 or .5, statistical power was computed. The level of significance of the test was set at .05 across conditions. Power and Type I error were computed by taking the proportions of a statistically significant group mean difference over replications.

Model fit indices

In addition to measurement invariance and latent-factor mean estimation, we also investigated the sensitivity of the model fit indices of the strict-invariance models. To investigate the sensitivity of the model fit indices of MGCFA and MIMIC to misspecification of the error structure, commonly used model fit indices—namely, the chi-square (χ2), root mean square error of approximation (RMSEA), comparative fit index (CFI), and standardized root mean residual (SRMR) statistics—were examined. We applied the Hu and Bentler (1999) criteria for RMSEA, SRMR, and CFI in the present study. That is, RMSEA greater than .06, SRMR greater than .08, and CFI less than .95 were considered as representing poor model fit. Also, a chi-square p value less than .05 was considered a poor model fit. We also examined the fit of the configural-invariance models using the same criteria, because configural invariance was violated in the heterogeneous error covariance conditions. Note that evaluation of the configural-invariance model fit was conducted only with MGCFA, because MIMIC does not have the flexibility to relax the factor loadings, intercept, and error variances simultaneously across groups.

Results

Simulation check with the correctly specified MGCFA model

To establish the baseline results for the latent-factor mean difference and model fit indices, we first fitted the correctly specified MGCFA. The results showed that the latent-factor group mean differences were estimated accurately and the corresponding power and Type I error rates were well established. The power rates reached up to 1.00, and Type I error rates were controlled across conditions, ranging from .05 to .07. With regard to model fit indices, the results overall showed good model fit. The average model fit values ranged from .01 to .03 (RMSEA), .01 to.04 (SRMR), .98 to 1 (CFI), and .45 to .49 (chi-square p value). The result table for the correctly specified model is not reported, but it is available upon request.

Measurement invariance tests

Table 2 shows the rejection rates of the measurement invariance tests with MGCFA and MIMIC. When error variances were heterogeneous, MGCFA controlled for the rejection rates around .05, as expected, because the error variances were allowed to be different in MGCFA (i.e., correctly specified model). On the other hand, MIMIC was a misspecified model, because it estimated a single set of error variances for both groups. The rejection rates of MIMIC were slightly inflated when metric invariance was tested with a large sample size under large error structure heterogeneity (e.g., .13 for the SB LR tests when the group size was 1,000 and there were two large-size misspecified items). The rejection rates of the scalar-invariance test were .05 or less across conditions.

Table 2 Rejection rates of metric and scalar invariance under error structure heterogeneity

When error covariances were heterogeneous, both MGCFA and MIMIC were misspecified models. Both MGCFA and MIMIC showed relatively high rejection rates when metric invariance was tested. As the sample size and the number and size of error covariance increased, the rejection rates increased considerably. In contrast, the rejection rates of the scalar invariance tests were generally low, with values less than or around .05, with some exceptions (i.e., when the group size was 1,000 and the number of misspecified items was two of large magnitude).

Latent-factor group mean difference

The results of latent-factor mean difference tests are presented in Table 3. The table includes relative bias, RMSE, and Type I error and power rates for MIMIC. Because relative bias and RMSE were similar across the three factor mean difference conditions, only those of the large effect size conditions are reported. The SE results were very similar to the RMSE results and not included in the table. The results for strict-invariance MGCFA are not included because no notable difference between strict-invariance MGCFA and MIMIC was found. Note that both models were misspecified in terms of the error structure.

Table 3 Accuracy of the estimated latent-factor difference and the corresponding power and Type I error for MIMIC

As is shown in Table 3, the latent-factor mean difference in MIMIC was estimated accurately with minimal bias, regardless of the error structure misspecifications. The maximum relative bias was – .03, which is less than 5% bias of the population value. In addition, the statistical power and Type I error rates were comparable to those of the correctly specified strong-invariance models. Type I errors were well-controlled, ranging from .05 to .07 across conditions, and power reached 1.00 when the effect size was large. As expected, larger effect size and sample size resulted in higher power across conditions.

Sensitivity of model fit indices

We computed sensitivity by taking the proportions of replications in which the fitted model was flagged as having poor model fit. Thus, a sensitivity rate close to 1.00 indicates that the model fit index was sensitive to the misspecification in the error structure. Table 4 shows sensitivity of the model fit indices for strict-invariance MGCFA and MIMIC across conditions.

Table 4 Sensitivity of the model fit indices of MGCFA and MIMIC

Three patterns emerged from the sensitivity rates (see Table 4). In general, the model fit indices for MIMIC showed less sensitivity than did those for MGCFA. In addition, for both strict-invariance MGCFA and MIMIC, the model fit indices were more sensitive to misspecification due to heterogeneous error covariances than to heterogeneous error variances. Among the model fit indices, CFI and SRMR were less sensitive than the RMSEA and chi-square tests. When heterogeneous error variances were present, the RMSEA and chi-square of strict-invariance MGCFA were more sensitive to the misspecification as the size and number of error misspecifications increased, whereas MIMIC consistently showed a good model fit across conditions, and the sensitivity rates were very similar to those of the correctly specified model. When error covariance in one group was ignored, the RMSEA and chi-square tests generally detected the model misspecification, showing higher sensitivity to larger misspecification. On the other hand, CFI only showed high sensitivity rates for the conditions in which the size and number of error misspecifications were large and two, respectively, and SRMR often failed to detect the misspecification, showing good fit across conditions. Especially, both CFI and SRMR in MIMIC were insensitive to the heterogeneity in error variance–covariance when the size and number of error misspecifications were small and one, respectively.

In addition, we also investigated the model fit results for the configural-invariance MGCFA model in this study. Because a heterogeneous covariance structure violates configural invariance due to the unmeasured factor structure in one group, it is worthwhile to examine the model fit of the configural-invariance model when a covariance error structure is present. Similar to in Table 4, RMSEA and chi-square, overall, showed very sensitive results to the heterogeneous covariance structure, whereas CFI and SRMR showed somewhat mixed results. For example, the sensitivity values ranged from .69 to 1.00 for RMSEA, and from .49 to 1.00 for chi-square across simulation conditions. CFI and SRMR, however, only showed sensitive results for conditions in which the size and number of heterogeneous covariances were large and two, respectively (e.g., the sensitivity rates ranged from .88 to 1 for CFI and from .97 to 1 for SRMR). For the other conditions, the sensitivity rates decreased substantially (e.g., the sensitivity rates ranged from .00 to .21 for CFI and from .00 to .11 for SRMR).

Discussion

MGCFA and MIMIC are two widely used models for testing measurement invariance and factor mean differences in applied research. Although a number of studies have investigated the efficacy of MGCFA and MIMIC for testing measurement invariance and factor mean differences, little research has been devoted to investigating the impact of misspecification of the error structure. In this study, therefore, we examined the impact of such violations on measurement invariance testing. We also examined the accuracy of latent-factor mean estimation, inference, and sensitivity of the model fit indices for MGCFA and MIMIC when the invariance assumption was violated for error variance–covariance across multiple groups.

Our key findings in this study were as follows. First, misspecification of the error structure—that is, misspecifying heterogeneous error variances and covariances as invariant—did not make a substantial impact on the estimation and statistical inference of parameter estimates in the mean structure (i.e., factor means and intercepts). Especially, we observed a minimal impact of such misspecification on factor mean difference estimation and testing. Although the fitted models were misspecified for the heterogeneous error variances or covariances, both MGCFA and MIMIC accurately estimated the latent-factor mean difference between the reference and focal groups. Also, statistical inferences such as the power and Type I error for the group mean difference were not affected. Note that we examined conditions in which the population latent-factor mean difference was varied (0, .1, and .5), and no substantial differences were found across conditions. As expected, statistical power increased as the latent-factor mean difference increased. These findings imply that misspecification of the error structures between groups is not of great concern when researchers are estimating and testing latent mean differences using MIMIC or MGCFA.Footnote 2

In addition, we observed the impact of the misspecified error structure on measurement invariance testing. This impact was more evident for metric-invariance testing when error covariance in one group was ignored. The rejection rates reached 100% when the sample size was large and the size of two error covariances was large (.40). The high rejection of metric invariance might have happened because the heterogeneity in error variance and covariance transferred to the factor loading difference when the error structure was constrained to be equal, which appeared as metric noninvariance with little impact on the mean structure (intercepts and means). Although factor loadings were generated to be equal in the population, we observed that the factor loadings were estimated strikingly differently between groups in the configural-invariance model when the error structure was forced to be homogeneous, which could be misinterpreted as an indication of metric noninvariance in applied research settings.

We also examined the sensitivity of model fit indices to misspecification of the error structure. The model fit indices of MIMIC were less sensitive to the error structure misspecification than were those of MGCFA, in general. For both MGCFA and MIMIC, the model fit indices were generally more sensitive to heterogeneous error covariances rather than variances. Among the model fit indices we examined, CFI and SRMR showed relatively less sensitivity to the error structure misspecification than did chi-square and RMSEA. Particularly, SRMR in MIMIC was completely insensitive and always showed good fit across conditions when the error structure was misspecified. The insensitivity of SRMR in MIMIC was also reported in a previous model-fit statistics study (Kim et al., 2012). This finding suggests that if the heterogeneity of error variances–covariances across groups is of concern, it is recommended to use MGCFA to detect a potential misspecification of the error structure. Beyond detecting the error structure misspecification through model fit evaluation, MGCFA is advantageous because it allows researchers to model heterogeneous error structures between groups, whereas modeling different error structures between groups is not feasible in MIMIC.

We additionally conducted a simulation with conditions in which both variance and covariance were heterogeneous across groups. However, the results were almost identical to those in the heterogeneous covariance conditions, which suggests that the misspecification of heterogeneous error covariances is more influential and also is detected better than misspecification of heterogeneous error variances across groups. In terms of measurement invariance, error covariance in one group is considered as a violation of configural invariance, whereas heterogeneous error variance is a violation of strict invariance. The sensitive model-fit results for the configural invariance model supported this finding as well. This finding implies that the lack of configural invariance in the error structure appears more consequential than is the lack of strict invariance, although both turned out not to be related to latent-factor mean estimation and testing.

To increase the generalizability of the study, we conducted an additional simulation. We considered a condition in which reliability was medium (ω = .75). The medium size of reliability was only considered for conditions in which the size of two error variances or covariances was large, because we observed the most evident result in these conditions. We still varied the level of sample size for the additional simulation. From the simulation, we found no substantial pattern between the two levels of reliability. That is, the error structure misspecification significantly affected the metric invariance test but not the scalar invariance test. The rejection rates increased substantially for metric invariance as the sample size increased for both MGCFA and MIMIC, and this pattern was more evident for the heterogeneous covariance condition. Also, the latent-factor mean estimation and inference were accurate and reliable. These results indicate that the findings of the study can be extended to medium-size reliability measurements.

Practically and theoretically, understanding the error structures and specifying correct error structures between groups is essential in multiple-group analysis for two major reasons, even though misspecification does not directly impact the latent and even the observed group mean comparisons. First, error covariance can be interpreted as the presence of an unmeasured factor. In other words, error covariance in one group suggests that this group may have a different factor structure than the other groups. Understanding the unmeasured factor in one group, either substantive or nuisance, is an essential part of group comparisons. Second, metric invariance is less likely to be supported when the error structure is considerably misspecified between groups. This is problematic because group heterogeneity in error variances–covariances is manifested in factor loadings. For example, in the configural-invariance model, the ignored error covariance in one group was manifested as notable differences in the factor loadings. In practical settings without knowledge of the true population parameters, differences in the factor loadings between groups observed in the configural-invariance model can be misinterpreted as an indication of metric noninvariance rather than of misspecification of the error structure (i.e., violation of configural invariance through error covariance in one group). Problematically, in subsequent metric-invariance testing, metric invariance will possibly be rejected. Moreover, metric invariance is a precondition of scalar invariance; that is, scalar invariance cannot be established without metric invariance. Thus, a group mean comparison may be invalidated because scalar invariance is considered a prerequisite for a valid group mean comparison. That is, even though the misspecification of the error structure does not impact the group mean comparison, applied researchers will be less likely to proceed with the comparison given the violation of metric invariance.

One positive finding is that some model fit indices are sensitive to the error structure misspecification in the configural-invariance model if the size and number of heterogeneity are large, which could lead to applied researchers scrutinizing the source of model misfit. For applied researchers, we recommend thoroughly investigating configural invariance across groups using MGCFA. In this investigation, we recommend using the RMSEA and chi-square model fit indices to uncover possible differences in error structures for multiple groups on the basis of simulation study results. Modification indices could guide researchers to find the source of model misfit when configural invariance is rejected, although future research is called for to investigate the performance of modification indices in detecting error covariance misspecification. Also, if researchers have strong theoretical evidence regarding nonzero error covariance for specific items, we recommend specifying the error covariances using MGCFA and testing for the equality of error structures between the groups. Theoretical consideration will be particularly useful when the size and number of misspecification are small, because the fit indices of the configural-invariance model may not be informative as to the error structure misspecification. Finally, if the research purpose is to test measurement invariance and error covariance is found in only one group, applied researchers can conclude that configural invariance has been violated, and subsequent analyses will not be executed. If testing and estimating the latent group mean difference is of focal interest, we expect a minimal impact of heterogeneous error variance and covariance on the results. However, it should be kept in mind that no statistical impact does not mean that the group mean difference will be theoretically interpretable. Thus, the heterogeneity in the error structure, and particularly error covariance, should be theoretically explained and justified.

There are several limitations to the present study. First, we considered only the one-factor model across simulation conditions. This study was conducted with a relatively simple model, but more complex models including bifactor or two-tier models can be considered in future research. Because more complex models could be more challenging for estimating their parameters, the results might be different in various situations. Second, in this study we did not manipulate different sample sizes for the reference and focal groups (unbalanced design).Footnote 3 Given that unbalanced designs are more common in applied educational and psychological settings (e.g., Lubke & Dolan, 2003; Woods, 2009), we recommend investigating the potential impact of unbalanced group sizes being paired with heterogeneous error structures in multiple-group analysis in future studies. Third, although we found that misspecification of the error structure minimally affects the latent-factor mean estimates and inference, the simulation study design was limited, and the results should not be generalized to all situations. More studies should be conducted to explore the impact of the error structure on MGCFA and MIMIC with different simulation conditions. For example, it would be interesting to investigate the impact of error structure misspecification with categorical observed variables. Future research should conduct a study investigating how error structure misspecification can affect measurement invariance tests and the latent mean estimates with the categorical observed variables using weighted least squares means and variance-adjusted estimation, given that categorical variables are prevalent in applied settings.

In conclusion, we investigated the impact of misspecified heterogeneous error variances–covariances in multiple-group analysis, including measurement invariance testing and latent-factor mean difference testing. We found that misspecification in the error variance–covariance structure affects testing and estimating the variance–covariance structure (i.e., factor loadings). In contrast, little impact was observed on the mean structure (i.e., intercepts and latent-factor means). Thus, we found that both the MGCFA and MIMIC approaches robustly estimated the latent-factor mean difference and yielded correct statistical inferences under the misspecification of error variance–covariance between groups. Strict invariance was also mostly supported when it was true in the population. However, metric invariance could be rejected under such misspecification. Thus, we suggest that MGCFA be preferred when heterogeneous error structures are of concern, because MGCFA allows for modeling heterogeneous error structures across groups. It should also be kept in mind that the model fit indices of MIMIC are not generally sensitive to misspecified error structures.