Scale length does matter: Recommendations for measurement invariance testing with categorical factor analysis and item response theory approaches

D’Urso, E. Damiano; De Roover, Kim; Vermunt, Jeroen K.; Tijmstra, Jesper

doi:10.3758/s13428-021-01690-7

Scale length does matter: Recommendations for measurement invariance testing with categorical factor analysis and item response theory approaches

Open access
Published: 15 December 2021

Volume 54, pages 2114–2145, (2022)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Scale length does matter: Recommendations for measurement invariance testing with categorical factor analysis and item response theory approaches

Download PDF

E. Damiano D’Urso ORCID: orcid.org/0000-0002-2212-5440¹,
Kim De Roover¹,
Jeroen K. Vermunt¹ &
…
Jesper Tijmstra¹

4079 Accesses
11 Altmetric
Explore all metrics

Abstract

In social sciences, the study of group differences concerning latent constructs is ubiquitous. These constructs are generally measured by means of scales composed of ordinal items. In order to compare these constructs across groups, one crucial requirement is that they are measured equivalently or, in technical jargon, that measurement invariance (MI) holds across the groups. This study compared the performance of scale- and item-level approaches based on multiple group categorical confirmatory factor analysis (MG-CCFA) and multiple group item response theory (MG-IRT) in testing MI with ordinal data. In general, the results of the simulation studies showed that MG-CCFA-based approaches outperformed MG-IRT-based approaches when testing MI at the scale level, whereas, at the item level, the best performing approach depends on the tested parameter (i.e., loadings or thresholds). That is, when testing loadings equivalence, the likelihood ratio test provided the best trade-off between true-positive rate and false-positive rate, whereas, when testing thresholds equivalence, the χ² test outperformed the other testing strategies. In addition, the performance of MG-CCFA’s fit measures, such as RMSEA and CFI, seemed to depend largely on the length of the scale, especially when MI was tested at the item level. General caution is recommended when using these measures, especially when MI is tested for each item individually.

Effect of Within-Group Dependency on Fit Statistics in Mokken Scale Analysis in the Presence of Two-Level Test Data

Does strict invariance matter? Valid group mean comparisons with ordered-categorical items

Article Open access 29 November 2023

Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Article 23 November 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

One of the main missions of psychological and social sciences is to study individuals as well as group differences with regard to latent constructs (e.g., extraversion). Such constructs are commonly measured by means of psychological scales in which subjects rate their level of agreement on various Likert-scale type of items by selecting one out of the possible response options. Most items’ response options range from 3 to 5 with a clear ordering (e.g., a score of 3 is higher than a score of 2 which is then higher than 1). Such items with few naturally ordered categories are called ordinal items.

Equivalence in the measurement of a psychological construct across groups is generally defined as measurement invariance (MI), and it is a crucial requirement to validly compare psychological constructs across groups (Borsboom, 2006; Meredith & Teresi, 2006). In fact, ignoring MI when statistically investigating differences between groups can lead to under/over estimation of group differences in item means (Jones & Gallo, 2002), sum-score means (Jeong & Lee, 2019), and regression parameters in structural equation models (Guenole & Brown, 2014).

In the context of psychological measurement, latent variable modeling is one of the most popular frameworks, and, within this framework, various approaches have been developed to model ordinal data as well as to test for MI. Among them, two of the most used ones are multiple group categorical confirmatory factor analysis (MG-CCFA) and multiple group item response theory (MG-IRT) (Kim & Yoon, 2011; Millsap, 2012). Interestingly, the difference between these two approaches is rather artificial, and parameters in MG-CCFA and MG-IRT models are known to be directly related (Takane & De Leeuw, 1987). Moreover, Chang et al., (2017) proposed a set of minimal identification constraints to make MG-CCFA and MG-IRT models fully equivalent.

The equivalence between these models, however, does not necessarily match the way MI is conceptualized and tested within each of the two approaches. For example, one main difference between MG-CCFA and MG-IRT refers to which hypotheses are tested. On the one hand, in MG-CCFA, measurement equivalence is mainly investigated at the scale level, or, in other words, the tested hypothesis is that the complete set of items functions equivalently across groups. On the other hand, in MG-IRT, more attention is dedicated towards the study of each individual item, and, for this reason, within this approach, MI is tested for each item in the scale separately. Another crucial difference relates to the way these hypotheses are tested. In fact, to test whether MI holds, either for a scale or for a specific item, different criteria and/or testing strategies are used within each approach.

Research to date has not yet determined the impact of these differences in terms of the performance to detect MI. For instance, some studies compared the performance of MG-CCFA and MG-IRT using solely an item-level testing perspective (Kim and Yoon, 2011; Chang et al., 2017), whereas Meade and Lautenschlager (2004) compared MG-IRT with multiple group confirmatory factor analysis for continuous data (i.e., MG-CFA). Providing clear guidelines on which approach to choose and in which setting is particularly helpful for applied researchers. In fact, having such guidelines might facilitate decisions regarding the level at which (non)invariance will be tested (e.g., scale or item level) as well as what are the most powerful tools to test it. However, in the current literature, clear guidelines have not yet been provided. Therefore, by means of two simulation studies, this paper makes three major contributions: (i) assess to what extent performing a scale- or an item-level test affects the power to detect MI, (ii) determine what MG-CCFA- or MG-IRT-based testing strategies/measures are more powerful to test MI, and (iii) based on the results of the simulation studies, provide guidelines on what approach to choose and in which conditions.

To this end, in “MG-CCFA, MG-IRT models and their MI test” we discuss both MG-CCFA- and MG-IRT-based models and illustrate how they are equivalent under a set of minimal identification constraints. Additionally, in the same section, for each of the two approaches, we discuss the differences in the set of hypotheses and the testing strategies in the context of MI. Afterwards, in “Simulation studies” we assess the performance of MG-CCFA- and MG-IRT-based testing strategies in testing MI by means of two simulation studies. Finally, in “Discussion”, we conclude by giving remarks and recommendations along with a summary of the main results obtained in the simulation studies.

MG-CCFA, MG-IRT models and their MI test

The models

Imagine having data composed of J items for a group of N subjects. Also, assume that a grouping variable exists such that subjects can be divided into G groups. Let X_j be the response on item j and further assume that X_j is a polytomously scored response which might take on C possible values, with c = {0,1,2,...,C-1}. Let us also assume that a unidimensional construct η underlies the observed responses (Chang et al., 2017).

Multiple group categorical confirmatory factor analysis

In MG-CCFA, it is assumed that C possible observed values are obtained from a discretization of a continuous unobserved response variable $X_{j}^{*}$ via some threshold parameters. The threshold $\tau _{j,c}^{(g)}$ indicates the dividing point for the categories (e.g., division between a score of 3 and 4). Additionally, these thresholds are created such that the first and the last one are defined as $\tau _{j,0}^{(g)}$ = -$\infty $ and $\tau _{j,c}^{(g)}$ = +$\infty $, respectively. Rewriting formally what we just described, we have:

$$ X_{j} = c, if \tau_{j,c}^{(g)} < X_{j}^{*} < \tau_{j,c+1}^{(g)} \textit{c} =0, 1,2,...,C-1. $$

(1)

If it is also assumed that the construct under study is unidimensional, according to a factor analytical model we have:

$$ X_{j}^{*} = \lambda_{j}^{(g)}\eta + \epsilon_{j}, j = 1,2,...,J. $$

(2)

Equation (2) shows that the unobserved continuous response variable $X_{j}^{*}$ is determined by a latent variable score η via the factor loading $\lambda _{j}^{(g)}$ and a residual component 𝜖_j. The latter represents an error term that is item-specific. It is important to note that the thresholds $\tau _{j,c}^{(g)}$ and loadings $\lambda _{j}^{(g)}$ are group-specific. Additionally, both the latent variable η and the item-specific residual component 𝜖_j are mutually independent and both normally distributed, with:

$$ {\eta^{(g)} \sim N(\kappa^{(g)}, \varphi^{(g)}), \text{and} \epsilon_{j}^{(g)} \sim N(0, \sigma_{j}^{2(g)}).} $$

(3)

where κ is the factor mean, φ the factor variance and ${\sigma _{j}^{2}}$ is the unique variance.

Multiple group normal ogive graded response model

MG-IRT models the probability of selecting a specific item category, given a score on the latent construct and given a specific group membership. These conditional probabilities, in the case of ordinal items, are modeled indirectly through building blocks that are constructed by means of specific functions. Different functions exist for ordinal items which, in turn, are used by different MG-IRT models. Because of its similarities with MG-CCFA (Chang et al., 2017), in the following, we only consider the multiple group normal ogive graded response model (MG-noGRM; Samejima, 1969). The MG-noGRM uses cumulative probabilities as its building blocks, and the underlying idea is to treat the multiple categories in a dichotomous fashion (Samejima, 1969). First, for each score, the probability of obtaining that score or higher is calculated (e.g., selecting 2 or above), given the latent construct η. Based on this set of probabilities, the probability of selecting a specific category (e.g., 2) is calculated, given a certain score on η. In the MG-noGRM, like in MG-CCFA, it is assumed that the observed values X_j arise from an underlying continuous latent response variable $X_{j}^{*}$.

Rewriting formally what we just described, the probability of scoring a certain category c is then:

$$ \begin{aligned} & P(X_{j}^{*} = c|\eta, g)\\ &= {\Phi}(\alpha_{j}^{(g)}(\eta- \delta_{j, c}^{(g)})) - {\Phi}(\alpha_{j}^{(g)}(\eta-\delta_{j,c+1}^{(g)}))\\ & = {\Phi}(\alpha_{j}^{(g)}\eta- \alpha_{j}^{(g)}\delta_{j, c}^{(g)}) - {\Phi}(\alpha_{j}^{(g)}\eta-\alpha_{j}^{(g)}\delta_{j, c+1}^{(g)}) \\ &= \displaystyle{\int}_{\alpha_{j}^{(g)}\eta - \alpha_{j}^{(g)}\delta_{j,c+1}^{(g)}}^{\alpha_{j}^{(g)}\eta - \alpha_{j}^{(g)}\delta_{j, c}^{(g)}} \phi(u_{j})du_{j} \end{aligned} $$

(4)

where, for group g $\alpha _{j}^{(g)}$ is the discrimination parameter for item j, and $\delta _{j, c}^{(g)}$ is the threshold parameter. The latter represents the point at which the probability of answering at or above category c is .5 for group g. Since ordered categories are modeled, the probability of getting at least the lowest score is 1, and the first threshold $\delta _{j,0}^{(g)}$ is not estimated and set to -$\infty $. That is, C-1 threshold parameters per group need to be estimated. It is relevant to highlight that, like in MG-CCFA, also in the case of the MG-noGRM the model parameters $\alpha _{j}^{(g)}$ and $\delta _{j, c}^{(g)}$ are group-specific. Also, ϕ(.) is the probability density function and Φ(.) is the cumulative distribution function of the standard normal distribution.

Similarities with MG-CCFA

The similarities between MG-CCFA and the MG-noGRM can be revealed by taking a closer look at how the parameters in the two models are related (Takane & De Leeuw, 1987; Kamata & Bauer, 2008; Chang et al., 2017):

$$ \alpha_{j}^{(g)} = \frac{\lambda_{j}^{(g)}}{\sigma_{j}}, u_{j} = \frac{\epsilon_{j}}{\sigma_{j}}, \delta_{j, c}^{(g)} = \frac{\tau_{j,c}^{(g)}}{\lambda_{j}^{(g)}}, $$

(5)

and how it is possible to write the probability of $X_{j}^{*}$ given η in MG-CCFA terms:

$$ \begin{aligned} & P(X_{j}^{*} = c|\eta, g) = \displaystyle{\int}_{\lambda_{j}^{(g)}\eta - \tau_{j,c+1}^{(g)}}^{\lambda_{j}^{(g)}\eta - \tau_{j,c}^{(g)}} \phi(\epsilon_{j})d\epsilon_{j}\\ & = \displaystyle{\int}_{\lambda_{j}^{(g)}\eta/ \sigma_{j} - \tau_{j,c+1}^{(g)}/ \sigma_{j}}^{ \lambda_{j}^{(g)}\eta/ \sigma_{j} - \tau_{j,c}^{(g)}/ \sigma_{j}} \phi(u_{j})du_{j}. \end{aligned} $$

(6)

The difference between (4) and (6) is that in MG-CCFA the loadings $\lambda _{j}^{(g)}$ and the thresholds $\tau _{j,c}^{(g)}$ can be inferred only in a relative sense. In fact, they can only be calculated through the ratio with the residual variance σ_j (Takane & De Leeuw, 1987; Kamata & Bauer, 2008; Chang et al., 2017). This is due to the absence of a scale for the latent response variable $X_{j}^{*}$. For ease of reading, in the following, only the term loading will be used to refer to both the discrimination parameters and the loadings.

Identification constraints and models equivalence

Identification of measurement models such as the ones considered here can be achieved by means of identification constraints, which are usually imposed either via specification of an arbitrary value for some parameters or by setting equalities across them. This way, the number of parameters to be estimated is reduced, and it is possible to find a unique solution in the estimation process (Millsap & Yun-Tein, 2004; San, 2013; Chang et al., 2017).

In testing MI with multiple groups, both for MG-CCFA and the MG-noGRM, it is necessary to ensure that a scale is set for (i) the latent response variable $X_{j}^{*}$, (ii) the latent construct η, and that (iii) the scale of the latent construct is aligned across groups such that the parameters can be directly compared (Kamata & Bauer, 2008; Chang et al., 2017). Interestingly, these constraints are commonly imposed in a different way in MG-CCFA and in the MG-noGRM.

The observed response for each item is assumed to arise, in both models, from an unobserved continuous response variable $X_{j}^{*}$. These underlying continuous response variables do not have a scale. For this reason, a scale has to be set by constraining their variances and means. In both models, the means of the latent response variables are indirectly constrained to be 0 by setting the intercepts κ to be 0, since $E(X^{*}_{j})$ = λ_jκ.

In both models, the means of the latent response variables are constrained to be 0. However, different ways to constrain the variances are generally used. It is common to either set their total variances $V(X_{j}^{*})$ to 1 (also called Delta parameterization; Muthén, 1998) or its unique variances ${\sigma _{j}^{2}}$ to 1 (also called Theta parameterization; Muthén, 1998). The former is much more common in MG-CCFA, while the latter is closer to what is usually done with the MG-noGRM (Kamata & Bauer, 2008).

The other unobserved element for which a scale has to be set is the latent construct η. Again, this is commonly addressed in a different way in the two approaches. On the one hand, in MG-CCFA a fixed value is commonly chosen for a threshold and a loading. On the other hand, in the MG-noGRM the scale of the latent variable is commonly defined by setting its mean and variance to 0 and 1, respectively. In both cases, these constraints are applied only for one of the two groups, which is usually called the reference group.

Finally, it is necessary to align the scale of both groups to make them comparable. This is commonly achieved by imposing equality constraints on some of the parameters in the model, which is again addressed differently in MG-CCFA and in the MG-noGRM. On the one hand, in MG-CCFA for each latent construct, the factor loading and the threshold of a single item are constrained to be equal across groups. Generally, the loading and the threshold of the first item of the scale are selected. On the other hand, in MG-IRT, multiple items, assumed to function equivalently in both groups, are set equal by constraining their parameters. These items form what is then called the anchor. Note that, in the MG-noGRM, and more generally in MG-IRT models, greater attention is devoted to choosing the items that are constrained to be equal across groups while in MG-CCFA this is not necessarily the case. Nevertheless, in MG-CCFA, French and Finch (2008) have noted that the referent indicator matters, and various methods have been developed to select one or more referent indicators (Lopez Rivas et al., 2009; Woods, 2009; Meade & Wright, 2012; Shi et al., 2017). For a recent overview and comparison of these methods, we refer the reader to Thompson et al., (2021).

A set of minimal constraints to make MG-CCFA and the MG-noGRM fully comparable have been recently proposed by Chang et al., (2017), which will also be presented here. Without loss of generality, imagine that two groups, g = r,f where r represents the reference group and f the focal group, exist. Following Chang et al., (2017):

$$ \sigma_{j}^{2(r)} = 1, \text{for \textit{j}= 1,..,\textit{J}} $$

(7)

$$ E(\eta^{(r)}) = 0, {\lambda_{1}^{(r)} = 1}, $$

(8)

$$ \begin{array}{@{}rcl@{}} \lambda_{1}^{(r)} &=& \lambda_{1}^{(f)}, \sigma_{1}^{2(r)} = \sigma_{1}^{2(f)}, \tau_{1,c}^{(r)} = \tau_{1,c}^{(f)}, \\&& \text{for some \textit{c} $\in$ (0,1,2,...,\textit{C}-1)} \end{array} $$

(9)

$$ \sigma_{j}^{2(r)} = \sigma_{j}^{2(f)} \text{for \textit{j} = 2,..,\textit{J}}. $$

(10)

These constraints serve the purpose to set a scale for the latent response variable $X_{j}^{*}$, for the latent construct η and to make the scale comparable across groups. That is, (7) and (8) set the scales of the latent response variable $X_{j}^{*}$ and the latent construct η for the reference group, while (9) makes the scale comparable across groups using the anchor. Finally, (10) guarantees a common scale across all the other items. Furthermore, the above-mentioned constraints can be seen as MG-IRT-type constraints where the unique variances ${\sigma _{j}^{2}}$ are constrained to be 1 both for the focal and the reference group, the mean of the latent construct η is set to 0 and at least one item is picked as the anchor item, which parameters are set to be equal across groups (Chang et al., 2017).

By means of these constraints, the two models are exactly the same. Thus, differences in testing MI between MG-CCFA and the MG-noGRM depend only on the level at which MI is tested (i.e., scale or item) as well as what measures and testing strategies are used to test it.

MI hypotheses

Generally, a measure is said to be invariant if the score that a person obtains on a scale does not depend on his/her belonging to a specific group but only on the underlying psychological construct. Formally, assume that a vector of scores on some items X is observed, where X {= X₁, X₂,..., X_j}, and that a vector of scores on some latent variables η underlies these scores, where η {= η₁, η₂,...,η_r}. Then, measurement invariance holds if:

$$ P(\mathbf{X}|\text{\boldmath$\eta$}, \mathbf{g}) = P(\mathbf{X}|\text{\boldmath$\eta$}). $$

(11)

Equation (11) shows that the probability of observing a set of scores X given the underlying latent construct (η) is the same across all groups. Moreover, the equation is quite general in the sense that no particular model is yet specified for P(X|η).

As discussed above, an equivalent model for P(X|η) can be specified for MG-CCFA and the MG-noGRM. Then, one of the main differences in the way these two approaches test MI is whether a test is conducted for the whole vector of scores at once or for each element of the vector separately. Although, in principle, both types of test can be conducted within each approach, the former is more common in MG-CCFA, while the latter is generally used within MG-IRT. However, in principle, both types of test can be conducted within each framework.

Scale level

In MG-CCFA, MI is tested for all items at once. Different model parameters can be responsible for measurement non-invariance, and they are tested in a step-wise fashion. In each step, a new model is estimated, with additional constraints imposed on certain parameters (e.g., loadings) to test their invariance. Then, the fit of the model to the data is evaluated to test whether these new constraints worsen it significantly. The latter being true indicates that at least some of the constrained parameters are non-invariant.

Configural

The starting point in MG-CCFA is testing configural invariance. In this first step, the aim is to test whether, across groups, the same number of factors hold and that each factor is measured by the same items. This is generally done by first specifying and then estimating the same model for all groups. Afterwards, fit measures are examined to determine whether the hypothesis of the same model underlying all groups is rejected or not.

Metric

If the hypothesis of configural invariance is not rejected, the next step is to test the equivalence of factor loadings. This step is also called the weak or metric invariance step. Commonly, the factor loadings of all items are constrained to be equal across groups. The hypothesis being tested here is that:

$$ H_{metric}: {\Lambda}^{(g)} = {\Lambda}. $$

(12)

If (12) is supported, the equivalence of factor loadings indicates that each measured variable contributes to each latent construct to a similar extent across groups (Putnick and Bornstein, 2016).

Scalar

If metric invariance holds, scalar invariance or invariance of the intercepts can be tested. In MG-CCFA, though, the observed data are assumed to come from an underlying continuous response variable $X_{j}^{*}$. This variable does not have a scale and, generally, its intercept is fixed to 0. That is why instead of the intercepts the thresholds are tested. To test the hypothesis of equal thresholds, these parameters are constrained to be equal across groups, while keeping the previous constraints in place. Formally, the hypothesis being tested is:

$$ H_{scalar}: T_{j}^{(g)} = T_{j} \text{for \textit{j} = 1, 2, .., \textit{J}}. $$

(13)

If the hypothesis in (13) is not rejected, it can be concluded that the thresholds parameters for all items are the same across groups. Finally, it is worth noting that, to obtain full factorial invariance, equivalence of the residual variances should also be tested (Meredith & Teresi, 2006). However, many researchers do not consider this step, since it is not relevant when comparing the mean of the latent constructs across groups (Vandenberg & Lance, 2000).

Item level

In MG-IRT, the functioning of each item is tested separately. An item shows differential item functioning (DIF) if the probability of selecting a certain category on that item differs across two groups, given the same score on the latent construct. It is important to highlight that, when DIF is tested following a typical MG-IRT-based approach, configural invariance is generally assumed. Also, compared to MG-CCFA where item parameters are firstly allowed to differ and then constrained to be equal across groups, testing DIF follows a different rationale. That is, the starting assumption is that all items function equivalently across groups. Formally:

$$ \begin{aligned} & H_{0}: \alpha_{j}^{(g)} = \alpha_{j} = \frac{\lambda_{j}^{(g)}}{\sigma_{j}} = \frac{\lambda_{j}}{\sigma_{j}}, \delta_{j, c}^{(g)} = \delta_{j, c} = \frac{\tau_{j, c}^{(g)}}{\lambda_{j}^{(g)}} = \frac{\tau_{j, c}}{\lambda_{j}}\\ & \text{for \textit{j} = 1,2,..,\textit{J}, \textit{c} = 0,1,2,...,\textit{C}-1}. \end{aligned} $$

(14)

The constraints on one item are then freed up to test whether its parameters are invariant, while keeping the other items constrained to be equal across groups. Afterwards, the procedure is iteratively repeated for all the other items in the scale. DIF can take two different forms: uniform and nonuniform.

Uniform DIF

Given two groups, an ordinal item shows uniform DIF when, between groups, the thresholds parameters differ. In formal terms:

$$ \begin{aligned} & H_{no uniformDIF}: \delta_{J/k, c}^{(g)} = \delta_{J/k, c} = \frac{\tau_{J/k, c}^{(g)}}{\lambda_{J/k}^{(g)}} = \frac{\tau_{J/k, c}}{\lambda_{J/k}}\\ & \text{for \textit{j} = 1,2,..,\textit{J}, \textit{c} = 0,1,2,...,\textit{C}-1 and for some \textit{k},}\\& \text{where \textit{k} = 1,2,...,\textit{J}}. \end{aligned} $$

(15)

where the subscript J/k stands for all items except item k. Equation (15) shows the hypothesis of no uniform DIF indicating that the thresholds of all items except item k (τ_J/k,c) are the same across groups. Furthermore, it is interesting to note the connection between uniform DIF and scalar invariance, since both can be seen as tests for shifts in the threshold parameters.

Nonuniform DIF

An ordinal item shows nonuniform DIF when the loading parameter differ across two groups. The tested hypothesis can be formally written as:

$$ \begin{aligned} & H_{no nonuniformDIF}: \alpha_{J/k}^{(g)} = \alpha_{J/k} = \frac{\lambda_{J/k}^{(g)}}{\sigma_{J/k}} = \frac{\lambda_{J/k}}{\sigma_{J/k}}\\ & \text{for \textit{j} = 1,2,..,\textit{J}, \textit{c} = 0,1,2,...,\textit{C}-1 and for some \textit{k},}\\& \text{where \textit{k} = 1,2,...,\textit{J}}. \end{aligned} $$

(16)

Equation (16) shows the hypothesis of no nonuniform DIF indicating that for all items except item k the loadings are the same for all groups. Note that, without any further specification on identification constraints used to identify the baseline model, this test differs from testing metric invariance in MG-CCFA not only because items are evaluated individually but also due to the presence of both loadings λ and unique variances σ². However, under the minimal identifiability constraints proposed by Chang et al., (2017), unique variances are constrained to be 1 and equal across groups, making this test equivalent to testing metric invariance in MG-CCFA but for each individual item.

MI testing strategies

MG-CCFA-based

Besides commonly testing different hypotheses, MG-CCFA and MG-IRT differ in terms of what testing strategies/measures are used to test these hypotheses. Within MG-CCFA, the common strategy is to estimate two nested models and then compare how well they fit the data. A measure of how well a model fits the data is commonly obtained by means of a goodness-of-fit index, which measures the similarity between the model-implied covariance structure and the covariance structure of the data (Cheung and Rensvold, 2002). To date, many fit indices exist, and they can be mainly divided into three categories: measures of absolute fit, misfit, and comparative fit (for a more detailed review on the available measures we refer the reader to Schreiber et al.,, 2006).

Absolute fit indices

Absolute fit indices focus on the exact fit of the model to the data and one of the most commonly used is the Chi-squared (χ²) test. Imagine a MG-CCFA model A, with $\chi ^{2}_{ModA}$ and df_ModA indicating the model χ² and degrees of freedom, which fits the data sufficiently well. To test one of the MI hypotheses (e.g., metric invariance), a new model is specified by constraining the parameters of interest (e.g., loadings) of all items to be equal across groups. Let us call this model B, with $\chi ^{2}_{ModB}$ and df_ModB. A χ² test is then conducted by looking at the difference in these two models:

$$ T \sim {\chi^{2}_{D}}(df_{D}) = \chi^{2}_{ModB}-\chi^{2}_{ModA}(df_{ModB}-df_{ModA}). $$

(17)

A significant T (e.g., using a significance level of .05) indicates that model B fits significantly worse, and thus that model A should be preferred. This implies that invariance of the constrained parameters (e.g., loadings) does not hold. Two considerable limitations of the χ² test are that, on the one hand, its performance is largely underpowered for small samples because the test statistic is only χ²-distributed as N goes to infinity (i.e., only with large samples). On the other hand, it is highly strict with large samples, indicating, for example, that two models are significantly different even when the differences in the parameters are small.

Misfit indices

On top of the well-known limitations of the χ² test, a general counterargument towards the use of absolute fit indices is that we might not be necessarily interested in the exact fit as much as the extent of misfit in the model (Millsap, 2012). In this case, misfit indices, such as the root mean square error approximation (RMSEA) can be used. This index quantifies the misfit per degrees of freedom in the model (Browne & Cudeck, 1993). Specifically, in the case of multiple groups, it can be expressed as:

$$ RMSEA = \sqrt{G} \sqrt{ max \left[ \frac{\chi^{2}_{ModA}}{df_{ModA}}-\frac{1}{N-1}, 0 \right] }. $$

(18)

Based on which MI hypothesis is tested, different criteria and procedures are used to determine whether the RMSEA is acceptable. In the configural step, the absolute value of RMSEA is used. Specifically, values between 0 and .05 indicate a “good” fit, and values between .05 and .08 are thought to be a “fair” fit (Browne & Cudeck, 1993; Brown, 2014). In the subsequent steps, the change in the RMSEA (ΔRMSEA) between the constrained and the unconstrained model is used instead of the absolute value of the measure. Specifically, a ΔRMSEA of .01 has been suggested as a cut-off value in the case of metric invariance and, similarly, a value of .01 should be used for scalar invariance (Cheung & Rensvold, 2002; Chen, 2007). When the change in the ΔRMSEA is higher than the specific cut-off, invariance is rejected.

Comparative fit indices

The third category of fit indices is the one of comparative fit, where the improvement of the hypothesized model compared to the null model is used as an index to test MI. Differently from exact fit indices, where the hypothesized model is compared against a saturated model (a model with df = 0), in comparative fit indices a comparison is conducted between the hypothesized model and the null model, with $\chi ^{2}_{ModNull}$ and df_ModNull. The latter is a model in which all the measured variables are uncorrelated (i.e., a model where there is no common factor). It is worth to note that numerous comparative fit measures exist and, among them, a well-known one is the comparative fit index (CFI) (Bentler, 1990). The CFI measures the overall improvement in the χ² in the tested model compared to the null model, and can be formally written as:

$$ CFI = 1 - \frac{\chi^{2}_{ModA}- df_{ModA}}{\chi^{2}_{ModNull} - df_{ModNull}} $$

(19)

where a value of .95 is used as a cut-off value in the configural invariance step to indicate a “good” fit (Bentler, 1990). In the subsequent steps, the common guidelines for cut-off values focus on the change in CFI (ΔCFI). Specifically, a ΔCFI larger than -.01 is considered to be problematic both in the case of testing for loadings and thresholds invariance (Cheung & Rensvold, 2002; Chen, 2007). It is worth noting that the default baseline model used in most CFA software (e.g., lavaan; Rosseel, 2012) may not be appropriate for testing MI and different alternatives exist (Widaman & Thompson, 2003; Lai & Yoon, 2015). Moreover, it is not yet clear whether the commonly accepted cut-off values for CFI, or alternative fit measures, can be directly applied to models that are not estimated using maximum likelihood, and caution is thus recommended in empirical practice when making decisions based on various goodness-of-fit indices (Xia & Yang, 2019).

MG-IRT-based

In MG-IRT-based approaches both parametric and nonparametric methods exist to test for uniform and nonuniform DIF. In this paper, the focus is on parametric methods, where a statistical model is assumed. Specifically, methods that compare the models’ likelihood functions will be discussed (for a more detailed discussion on both parametric and nonparametric methods for DIF detection, we refer the reader to Millsap, 2012).

Likelihood-ratio test

One well-known technique for the study of DIF is the likelihood-ratio test (LRT) (Thissen et al., 1986; Thissen, 1988; Thissen et al., 1993). In this test, the log-likelihood of a model with the parameters of all items constrained to be equal across groups is compared against the log-likelihood of the same model with freed parameters for one item only. The former is sometimes called the compact model (L_C), while the latter is sometimes called the augmented model (L_A, Kim & Cohen, 1998; Finch, 2005). Once these two models are estimated and the log-likelihood (lnL_C and lnL_A) is obtained, the test statistic (G²) can be calculate using the following formula:

$$ G^{2} = -2lnL_{C} - (-2lnL_{A}) = -2lnL_{C} + 2lnL_{A}. $$

(20)

Similarly to the Chi-squared test in MG-CCFA, the test statistic G² is χ² distributed with df equal to the difference in the number of parameters estimated in the two models (Thissen, 1988). The same procedure is then iteratively repeated for all items. It is important to highlight that the above equation represents an an omnibus test of DIF, which in case of a significant result could be further inspected by constraining only specific parameters. For example, it would be possible to test uniform DIF by allowing only the thresholds to vary across groups.

Logistic regression

Logistic regression (LoR; Swaminathan and Rogers, 1990) is another parametric approach that has recently gained interest among DIF experts (Yasemin et al., 2015). The intuition behind the LoR approach is similar to the one of step-wise regression in which one can test whether the model improves by sequentially entering new predictors. The common order in which the variables are introduced, starting with a null model where only the intercept is estimated, is by first adding the latent construct, then the grouping variable, and finally an interaction between the latent construct and the grouping variable. Formally, this sequence of models is written as:

$$ Model 0: logit P(y_{j} \geq c) = \nu_{c}; $$

(21)

$$ Model 1: logit P(y_{j} \geq c) = \nu_{c} + {\upbeta}_{1}\eta; $$

(22)

$$ Model 2: logit P(y_{j} \geq c) = \nu_{c} + {\upbeta}_{1}\eta + {\upbeta}_{2}G; $$

(23)

$$ Model 3: logit P(y_{j} \geq c) = \nu_{c} + {\upbeta}_{1}\eta + {\upbeta}_{2}G + {\upbeta}_{3}\eta G. $$

(24)

In the equations above, P(y_j ≥ c) is the probability of the score on item j falling in category c or higher, and ν_c is a category-specific intercept. It is worth pointing out that, compared to the LRT, the latent variable scores are in this case only estimated once and then treated as observed, which can be problematic. In fact, since the latent variable scores are estimated and not observed, there might be uncertainty in the estimates, which could, in turn, affect the performance of this method. Moreover, some alternative formulations make use of sum scores instead of estimates of latent variable scores (Rogers & Swaminathan, 1993). Once the logistic regression models are estimated and a G² is obtained, an omnibus DIF test can be conducted by:

$$ G_{omnibus}^{2} = G_{Model3}^{2} - G_{Model1}^{2}, $$

(25)

which is asymptotically χ² distributed with df = 2 (Swaminathan & Rogers, 1990). Zumbo (1999) suggested to investigate the source of bias by separately testing for uniform and nonuniform DIF, respectively:

$$ G_{uniDIF}^{2} = G_{Model2}^{2} - G_{Model1}^{2} $$

(26)

and:

$$ G_{nonuniDIF}^{2} = G_{Model3}^{2} - G_{Model2}^{2} $$

(27)

where both (26) and (27) are χ² distributed with df = 1.

The omnibus test procedure (25) turned out to have an inflated number of incorrectly flagged DIF items (type I error; Li and Stout, 1996). To solve this issue, a combination of a significant 2-df LRT (25) and a measure of the magnitude of DIF using a pseudo-R² statistic has been suggested as an alternative criterion (Zumbo, 1999). The underlying idea is to treat the β coefficients as weighted least squares estimates and look at the differences in pseudo-R² (ΔR²) measures between the model with and without the added predictor (e.g., Cox & Snell 1989). Specifically, to flag an item as DIF, both a significant χ² test (with df = 2) and an effect size measure with an ΔR² of at least .13 is suggested to be used (Zumbo, 1999).

Simulation studies

To evaluate the impact of MG-CCFA- and MG-IRT-based hypotheses and testing strategies on the power to detect violations of MI, two simulation studies were performed. In the first study, an invariance scenario was simulated where parameters were invariant between groups. In the second study, a non-invariance scenario was simulated where model parameters varied between groups.

Simulation Study 1: invariance

In the first study, three main factors were manipulated:

1.
The number of items at two levels: 5, 25, to simulate a short and a long scale;
2.
The number of categories for each item at two levels: 3, 5;
3.
The number of subjects within each group at two levels: 250, 1000.

These factors were chosen to represent situations that can be encountered in psychological measurement. For example, the two levels at which the scale length varies are representative of (i) short scales that are used as an initial screening or to save assessment time in case of multiple administrations (e.g., clinical setting), and (ii) long scales typically used to obtain a more detailed and clear evaluation of the measured psychological construct. For the number of categories, the two levels mimic items constructed to capture a less or more nuanced degree of an agreement. Finally, the two simulated sample sizes resemble studies with “relatively”small samples (e.g., clinical setting) and with large samples (e.g., cross-cultural research).

A full-factorial design was used with 2 (number of items) x 2 (number of categories) x 2 (number of subjects within each group) = 8 conditions. For each condition, 500 replications were generated.

Method

Data generation

Data were generated from a factor model with one factor and two groups. The population values of the model parameters were chosen prior to conducting the simulation study and are reported in Table 1. Note that, for both groups, the factor mean and variance was set to 0 and 1, respectively. The choice of the values began with specifying the standardized loadings. Specifically, they were selected to resemble the ones commonly found in real applications with items having medium to high correlation with the common factor but differing among them (Stark et al., 2006; Wirth & Edwards, 2007; Kim & Yoon, 2011).

Table 1 Population values used in the simulation study

Scale length does matter: Recommendations for measurement invariance testing with categorical factor analysis and item response theory approaches

Abstract

Similar content being viewed by others

Effect of Within-Group Dependency on Fit Statistics in Mokken Scale Analysis in the Presence of Two-Level Test Data

Does strict invariance matter? Valid group mean comparisons with ordered-categorical items

Investigating the performance of level-specific fit indices in multilevel confirmatory factor analysis with dichotomous indicators: A Monte Carlo study

Introduction

MG-CCFA, MG-IRT models and their MI test

The models

Multiple group categorical confirmatory factor analysis

Multiple group normal ogive graded response model

Similarities with MG-CCFA

Identification constraints and models equivalence

MI hypotheses

Scale level

Configural

Metric

Scalar

Item level

Uniform DIF

Nonuniform DIF

MI testing strategies

MG-CCFA-based

Absolute fit indices

Misfit indices

Comparative fit indices

MG-IRT-based

Likelihood-ratio test

Logistic regression

Simulation studies

Simulation Study 1: invariance

Method

Data generation

Data analysis

Scale level

Item level

Outcome measures

Data simulation, software, and packages

Results

Convergence rate

Scale-level performance

Item-level performance

Simulation Study 2: Non-invariance

Method

Data analysis

Outcome measures

Results

Scale level

Item level

Scale-level performance

Item-level performance

Conclusions

Discussion

Open practices:

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation